{ "cells": [ { "cell_type": "markdown", "id": "77149e57", "metadata": {}, "source": [ "# Overview of User Defined Functions with cuDF" ] }, { "cell_type": "code", "execution_count": 1, "id": "0c6b65ce", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from numba import config\n", "\n", "config.CUDA_LOW_OCCUPANCY_WARNINGS = 0\n", "\n", "import cudf\n", "from cudf.datasets import randomdata" ] }, { "cell_type": "markdown", "id": "8826af13", "metadata": {}, "source": [ "Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.\n", "\n", "In conjunction with the broader GPU PyData ecosystem, cuDF provides interfaces to run UDFs on a variety of data structures. Currently, we can only execute UDFs on numeric, boolean, datetime, and timedelta typed data with partial support for strings in some APIs. This guide covers writing and executing UDFs on the following data structures:\n", "\n", "- Series\n", "- DataFrame\n", "- Rolling Windows Series\n", "- Groupby DataFrames\n", "- CuPy NDArrays\n", "- Numba DeviceNDArrays\n", "\n", "It also demonstrates cuDF's default null handling behavior, and how to write UDFs that can interact with null values." ] }, { "cell_type": "markdown", "id": "32a8f4fb", "metadata": {}, "source": [ "## Series UDFs\n", "\n", "You can execute UDFs on Series in two ways:\n", "\n", "- Writing a standard python function and using `cudf.Series.apply`\n", "- Writing a Numba kernel and using Numba's `forall` syntax\n", "\n", "Using `apply` or is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we'll be writing only simple kernels in this guide)." ] }, { "cell_type": "markdown", "id": "49399a84", "metadata": {}, "source": [ "### `cudf.Series.apply`" ] }, { "cell_type": "markdown", "id": "0a209ea2", "metadata": {}, "source": [ "cuDF provides a similar API to `pandas.Series.apply` for applying scalar UDFs to series objects. Here is a very basic example." ] }, { "cell_type": "code", "execution_count": 2, "id": "e28d5b82", "metadata": {}, "outputs": [], "source": [ "# Create a cuDF series\n", "sr = cudf.Series([1, 2, 3])" ] }, { "cell_type": "markdown", "id": "48a9fa5e", "metadata": {}, "source": [ "UDFs destined for `cudf.Series.apply` might look something like this:" ] }, { "cell_type": "code", "execution_count": 3, "id": "96aeb19f", "metadata": {}, "outputs": [], "source": [ "# define a scalar function\n", "def f(x):\n", " return x + 1" ] }, { "cell_type": "markdown", "id": "e61d0169", "metadata": {}, "source": [ "`cudf.Series.apply` is called like `pd.Series.apply` and returns a new `Series` object:" ] }, { "cell_type": "code", "execution_count": 4, "id": "8ca08834", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 3\n", "2 4\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sr.apply(f)" ] }, { "cell_type": "markdown", "id": "c98dab03", "metadata": {}, "source": [ "### Functions with Additional Scalar Arguments" ] }, { "cell_type": "markdown", "id": "2aa3df6f", "metadata": {}, "source": [ "In addition, `cudf.Series.apply` supports `args=` just like pandas, allowing you to write UDFs that accept an arbitrary number of scalar arguments. Here is an example of such a function and it's API call in both pandas and cuDF:" ] }, { "cell_type": "code", "execution_count": 5, "id": "8d156d01", "metadata": {}, "outputs": [], "source": [ "def g(x, const):\n", " return x + const" ] }, { "cell_type": "code", "execution_count": 6, "id": "1dee82d7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 43\n", "1 44\n", "2 45\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cuDF apply\n", "sr.apply(g, args=(42,))" ] }, { "cell_type": "markdown", "id": "22739e28", "metadata": {}, "source": [ "As a final note, `**kwargs` is not yet supported." ] }, { "cell_type": "markdown", "id": "afbf33dc", "metadata": {}, "source": [ "### Nullable Data" ] }, { "cell_type": "markdown", "id": "5dc06e8c", "metadata": {}, "source": [ "The null value `NA` an propagates through unary and binary operations. Thus, `NA + 1`, `abs(NA)`, and `NA == NA` all return `NA`. To make this concrete, let's look at the same example from above, this time using nullable data:" ] }, { "cell_type": "code", "execution_count": 7, "id": "bda261dd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 <NA>\n", "2 3\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a cuDF series with nulls\n", "sr = cudf.Series([1, cudf.NA, 3])\n", "sr" ] }, { "cell_type": "code", "execution_count": 8, "id": "0123ae07", "metadata": {}, "outputs": [], "source": [ "# redefine the same function from above\n", "def f(x):\n", " return x + 1" ] }, { "cell_type": "code", "execution_count": 9, "id": "e95868dd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 <NA>\n", "2 4\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cuDF result\n", "sr.apply(f)" ] }, { "cell_type": "markdown", "id": "97372e15", "metadata": {}, "source": [ "Often however you want explicit null handling behavior inside the function. cuDF exposes this capability the same way as pandas, by interacting directly with the `NA` singleton object. Here's an example of a function with explicit null handling:" ] }, { "cell_type": "code", "execution_count": 10, "id": "6c65241b", "metadata": {}, "outputs": [], "source": [ "def f_null_sensitive(x):\n", " # do something if the input is null\n", " if x is cudf.NA:\n", " return 42\n", " else:\n", " return x + 1" ] }, { "cell_type": "code", "execution_count": 11, "id": "ab0f4dbf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 42\n", "2 4\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cuDF result\n", "sr.apply(f_null_sensitive)" ] }, { "cell_type": "markdown", "id": "bdddc4e8", "metadata": {}, "source": [ "In addition, `cudf.NA` can be returned from a function directly or conditionally. This capability should allow you to implement custom null handling in a wide variety of cases." ] }, { "cell_type": "markdown", "id": "cc7c7e67", "metadata": {}, "source": [ "### String data" ] }, { "cell_type": "markdown", "id": "81762aea", "metadata": {}, "source": [ "Experimental support for a subset of string functionality is available for `apply`. The following string operations are currently supported:\n", "\n", "- `str.count`\n", "- `str.startswith`\n", "- `str.endswith`\n", "- `str.find`\n", "- `str.rfind`\n", "- `str.isalnum`\n", "- `str.isdecimal`\n", "- `str.isdigit`\n", "- `str.islower`\n", "- `str.isupper`\n", "- `str.isalpha`\n", "- `str.istitle`\n", "- `str.isspace`\n", "- `==`, `!=`, `>=`, `<=`, `>`, `<` (between two strings)\n", "- `len` (e.g. `len(some_string))`\n", "- `in` (e.g, `'abc' in some_string`)\n", "- `strip`\n", "- `lstrip`\n", "- `rstrip`\n", "- `upper`\n", "- `lower`\n", "- `+` (string concatenation)\n", "- `replace`" ] }, { "cell_type": "code", "execution_count": 12, "id": "d7d1abd7", "metadata": {}, "outputs": [], "source": [ "sr = cudf.Series([\"\", \"abc\", \"some_example\"])" ] }, { "cell_type": "code", "execution_count": 13, "id": "e8538ba0", "metadata": {}, "outputs": [], "source": [ "def f(st):\n", " if len(st) > 0:\n", " if st.startswith(\"a\"):\n", " return 1\n", " elif \"example\" in st:\n", " return 2\n", " else:\n", " return -1\n", " else:\n", " return 42" ] }, { "cell_type": "code", "execution_count": 14, "id": "23524fd8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 42\n", "1 1\n", "2 2\n", "dtype: int64\n" ] } ], "source": [ "result = sr.apply(f)\n", "print(result)" ] }, { "cell_type": "markdown", "id": "71c4cb83", "metadata": {}, "source": [ "### String UDF Memory Considerations" ] }, { "cell_type": "markdown", "id": "e41a82a1", "metadata": {}, "source": [ "UDFs that create intermediate strings as part of the computation may require memory tuning. An API is provided for convenience to accomplish this:" ] }, { "cell_type": "code", "execution_count": 15, "id": "b26ec6dc", "metadata": {}, "outputs": [], "source": [ "from cudf.core.udf.utils import set_malloc_heap_size\n", "\n", "set_malloc_heap_size(int(2e9))" ] }, { "cell_type": "markdown", "id": "54cafbc0", "metadata": {}, "source": [ "### Lower level control with custom `numba` kernels" ] }, { "cell_type": "markdown", "id": "00914f2a", "metadata": {}, "source": [ "In addition to the Series.apply() method for performing custom operations, you can also pass Series objects directly into [CUDA kernels written with Numba](https://numba.readthedocs.io/en/stable/cuda/kernels.html).\n", "Note that this section requires basic CUDA knowledge. Refer to [numba's CUDA documentation](https://numba.readthedocs.io/en/stable/cuda/index.html) for details.\n", "\n", "The easiest way to write a Numba kernel is to use `cuda.grid(1)` to manage thread indices, and then leverage Numba's `forall` method to configure the kernel for us. Below, define a basic multiplication kernel as an example and use `@cuda.jit` to compile it." ] }, { "cell_type": "code", "execution_count": 16, "id": "732434f6", "metadata": {}, "outputs": [], "source": [ "df = randomdata(nrows=5, dtypes={\"a\": int, \"b\": int, \"c\": int}, seed=12)" ] }, { "cell_type": "code", "execution_count": 17, "id": "4f5997e5", "metadata": {}, "outputs": [], "source": [ "from numba import cuda\n", "\n", "\n", "@cuda.jit\n", "def multiply(in_col, out_col, multiplier):\n", " i = cuda.grid(1)\n", " if i < in_col.size: # boundary guard\n", " out_col[i] = in_col[i] * multiplier" ] }, { "cell_type": "markdown", "id": "d9667a55", "metadata": {}, "source": [ "This kernel will take an input array, multiply it by a configurable value (supplied at runtime), and store the result in an output array. Notice that we wrapped our logic in an `if` statement. Because we can launch more threads than the size of our array, we need to make sure that we don't use threads with an index that would be out of bounds. Leaving this out can result in undefined behavior.\n", "\n", "To execute our kernel, must pre-allocate an output array and leverage the `forall` method mentioned above. First, we create a Series of all `0.0` in our DataFrame, since we want `float64` output. Next, we run the kernel with `forall`. `forall` requires us to specify our desired number of tasks, so we'll supply in the length of our Series (which we store in `size`). The [__cuda_array_interface__](https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html) is what allows us to directly call our Numba kernel on our Series." ] }, { "cell_type": "code", "execution_count": 18, "id": "ea6008a6", "metadata": {}, "outputs": [], "source": [ "size = len(df[\"a\"])\n", "df[\"e\"] = 0.0\n", "multiply.forall(size)(df[\"a\"], df[\"e\"], 10.0)" ] }, { "cell_type": "markdown", "id": "3fb69909", "metadata": {}, "source": [ "After calling our kernel, our DataFrame is now populated with the result." ] }, { "cell_type": "code", "execution_count": 19, "id": "183a82ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>963</td>\n", " <td>1005</td>\n", " <td>997</td>\n", " <td>9630.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>977</td>\n", " <td>1026</td>\n", " <td>980</td>\n", " <td>9770.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1048</td>\n", " <td>1026</td>\n", " <td>1019</td>\n", " <td>10480.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1078</td>\n", " <td>960</td>\n", " <td>985</td>\n", " <td>10780.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>979</td>\n", " <td>982</td>\n", " <td>1011</td>\n", " <td>9790.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c e\n", "0 963 1005 997 9630.0\n", "1 977 1026 980 9770.0\n", "2 1048 1026 1019 10480.0\n", "3 1078 960 985 10780.0\n", "4 979 982 1011 9790.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "ab9c305e", "metadata": {}, "source": [ "This API allows a you to theoretically write arbitrary kernel logic, potentially accessing and using elements of the series at arbitrary indices and use them on cuDF data structures. Advanced developers with some CUDA experience can often use this capability to implement iterative transformations, or spot treat problem areas of a data pipeline with a custom kernel that does the same job faster." ] }, { "cell_type": "markdown", "id": "0acc6ef2", "metadata": {}, "source": [ "## DataFrame UDFs\n", "\n", "Like `cudf.Series`, there are multiple ways of using UDFs on dataframes, which essentially amount to UDFs that expect multiple columns as input:\n", "\n", "- `cudf.DataFrame.apply`, which functions like `pd.DataFrame.apply` and expects a row udf\n", "- `cudf.DataFrame.apply_rows`, which is a thin wrapper around numba and expects a numba kernel\n", "- `cudf.DataFrame.apply_chunks`, which is similar to `cudf.DataFrame.apply_rows` but offers lower level control." ] }, { "cell_type": "markdown", "id": "2102c3ed", "metadata": {}, "source": [ "### `cudf.DataFrame.apply`" ] }, { "cell_type": "markdown", "id": "238bec41", "metadata": {}, "source": [ "`cudf.DataFrame.apply` is the main entrypoint for UDFs that expect multiple columns as input and produce a single output column. Functions intended to be consumed by this API are written in terms of a \"row\" argument. The \"row\" is considered to be like a dictionary and contains all of the column values at a certain `iloc` in a `DataFrame`. The function can access these values by key within the function, the keys being the column names corresponding to the desired value. Below is an example function that would be used to add column `A` and column `B` together inside a UDF." ] }, { "cell_type": "code", "execution_count": 20, "id": "73653918", "metadata": {}, "outputs": [], "source": [ "def f(row):\n", " return row[\"A\"] + row[\"B\"]" ] }, { "cell_type": "markdown", "id": "b5eb32dd", "metadata": {}, "source": [ "Let's create some very basic toy data containing at least one null." ] }, { "cell_type": "code", "execution_count": 21, "id": "077feb75", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>6</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B\n", "0 1 4\n", "1 2 <NA>\n", "2 3 6" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.DataFrame({\"A\": [1, 2, 3], \"B\": [4, cudf.NA, 6]})\n", "df" ] }, { "cell_type": "markdown", "id": "609a3da5", "metadata": {}, "source": [ "Finally call the function as you would in pandas - by using a lambda function to map the UDF onto \"rows\" of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 22, "id": "091e39e1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5\n", "1 <NA>\n", "2 9\n", "dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "44e54c31", "metadata": {}, "source": [ "The same function should produce the same result as pandas:" ] }, { "cell_type": "code", "execution_count": 23, "id": "bd345fab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5\n", "1 <NA>\n", "2 9\n", "dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_pandas(nullable=True).apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "004fbbba", "metadata": {}, "source": [ "Notice that Pandas returns `object` dtype - see notes on this in the caveats section." ] }, { "cell_type": "markdown", "id": "0b11c172", "metadata": {}, "source": [ "Like `cudf.Series.apply`, these functions support generalized null handling. Here's a function that conditionally returns a different value if a certain input is null:" ] }, { "cell_type": "code", "execution_count": 24, "id": "b70f4b3b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a\n", "0 1\n", "1 <NA>\n", "2 3" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(row):\n", " x = row[\"a\"]\n", " if x is cudf.NA:\n", " return 0\n", " else:\n", " return x + 1\n", "\n", "\n", "df = cudf.DataFrame({\"a\": [1, cudf.NA, 3]})\n", "df" ] }, { "cell_type": "code", "execution_count": 25, "id": "0313c8df", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 0\n", "2 4\n", "dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "313c77f3", "metadata": {}, "source": [ "`cudf.NA` can also be directly returned from a function resulting in data that has the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that `1 + 3 > 3` and returns `NA` for that row:" ] }, { "cell_type": "code", "execution_count": 26, "id": "96a7952a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 2\n", "1 2 1\n", "2 3 1" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(row):\n", " x = row[\"a\"]\n", " y = row[\"b\"]\n", " if x + y > 3:\n", " return cudf.NA\n", " else:\n", " return x + y\n", "\n", "\n", "df = cudf.DataFrame({\"a\": [1, 2, 3], \"b\": [2, 1, 1]})\n", "df" ] }, { "cell_type": "code", "execution_count": 27, "id": "e0815f60", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3\n", "1 3\n", "2 <NA>\n", "dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "b9c674f4", "metadata": {}, "source": [ "Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here's a null aware op between an int and a float column:" ] }, { "cell_type": "code", "execution_count": 28, "id": "495efd14", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0.5</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>3.14</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 0.5\n", "1 2 <NA>\n", "2 3 3.14" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(row):\n", " return row[\"a\"] + row[\"b\"]\n", "\n", "\n", "df = cudf.DataFrame({\"a\": [1, 2, 3], \"b\": [0.5, cudf.NA, 3.14]})\n", "df" ] }, { "cell_type": "code", "execution_count": 29, "id": "678b0b5a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.5\n", "1 <NA>\n", "2 6.14\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "ce0897c0", "metadata": {}, "source": [ "Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:\n", "\n", "```python\n", "def f(x):\n", " if x > 1000:\n", " return 1.5\n", " else:\n", " return 2\n", "```\n", "And your data is:\n", "```python\n", "[1,2,3,4,5]\n", "```\n", "You will get floats in the final data even though a float is never returned. This is because Numba ultimately needs to produce one function that can handle any data, which means if there's any possibility a float could result, you must always assume it will happen. Here's an example of a function that returns a scalar in some cases:" ] }, { "cell_type": "code", "execution_count": 30, "id": "acf48d56", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a\n", "0 1\n", "1 3\n", "2 5" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(row):\n", " x = row[\"a\"]\n", " if x > 3:\n", " return x\n", " else:\n", " return 1.5\n", "\n", "\n", "df = cudf.DataFrame({\"a\": [1, 3, 5]})\n", "df" ] }, { "cell_type": "code", "execution_count": 31, "id": "78a98172", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.5\n", "1 1.5\n", "2 5.0\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "2ceaece4", "metadata": {}, "source": [ "Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:" ] }, { "cell_type": "code", "execution_count": 32, "id": "142c30a9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>d</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>4</td>\n", " <td><NA></td>\n", " <td>8</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>7</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>6</td>\n", " <td>4</td>\n", " <td>8</td>\n", " <td>6</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c d e\n", "0 1 4 <NA> 8 7\n", "1 2 5 4 7 1\n", "2 3 6 4 8 6" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(row):\n", " return row[\"a\"] + (row[\"b\"] - (row[\"c\"] / row[\"d\"])) % row[\"e\"]\n", "\n", "\n", "df = cudf.DataFrame(\n", " {\n", " \"a\": [1, 2, 3],\n", " \"b\": [4, 5, 6],\n", " \"c\": [cudf.NA, 4, 4],\n", " \"d\": [8, 7, 8],\n", " \"e\": [7, 1, 6],\n", " }\n", ")\n", "df" ] }, { "cell_type": "code", "execution_count": 33, "id": "fee9198a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 <NA>\n", "1 2.428571429\n", "2 8.5\n", "dtype: float64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(f, axis=1)" ] }, { "cell_type": "markdown", "id": "71e30d33", "metadata": {}, "source": [ "### String Data" ] }, { "cell_type": "markdown", "id": "1a3694ea", "metadata": {}, "source": [ "String data may be used inside `DataFrame.apply` UDFs, subject to the same constraints as those for `Series.apply`. See the section on string handling for `Series` UDFs above for details. Below is a simple example extending the row UDF logic from above in the case of a string column:" ] }, { "cell_type": "code", "execution_count": 34, "id": "cccd59f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>str_col</th>\n", " <th>scale</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>abc</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ABC</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Example</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " str_col scale\n", "0 abc 1\n", "1 ABC 2\n", "2 Example 3" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df = cudf.DataFrame(\n", " {\"str_col\": [\"abc\", \"ABC\", \"Example\"], \"scale\": [1, 2, 3]}\n", ")\n", "str_df" ] }, { "cell_type": "code", "execution_count": 35, "id": "35737fd9", "metadata": {}, "outputs": [], "source": [ "def f(row):\n", " st = row[\"str_col\"]\n", " scale = row[\"scale\"]\n", "\n", " if len(st) > 5:\n", " return len(st) + scale\n", " else:\n", " return len(st)" ] }, { "cell_type": "code", "execution_count": 36, "id": "4ede4d5b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 3\n", "1 3\n", "2 10\n", "dtype: int64\n" ] } ], "source": [ "result = str_df.apply(f, axis=1)\n", "print(result)" ] }, { "cell_type": "markdown", "id": "5e2d6f15-b2a0-4075-8fe4-db3e99927fe6", "metadata": {}, "source": [ "### User Defined Aggregation Functions (UDAFs) and `GroupBy.apply`" ] }, { "cell_type": "markdown", "id": "22687948-6201-4d7a-ab18-82c888281e6a", "metadata": {}, "source": [ "cuDF provides support for accelerating a subset of user defined aggregations through the GroupBy apply `jit` engine, which is based on numba. Aggregations meeting the criteria necessary for execution through the `jit` engine shall be run as such automatically. Users wishing to develop aggregation functions for the `jit` engine may call it explicitly by passing `engine='jit'` to `apply`:" ] }, { "cell_type": "code", "execution_count": 37, "id": "7f0c172a-94a6-4f84-8a94-c868610875bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2</td>\n", " <td>5</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>2</td>\n", " <td>6</td>\n", " <td>8</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 1 1 3\n", "1 1 2 4\n", "2 1 3 5\n", "3 2 4 6\n", "4 2 5 7\n", "5 2 6 8" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a dataframe with two groups\n", "df = cudf.DataFrame(\n", " {\n", " \"a\": [1, 1, 1, 2, 2, 2],\n", " \"b\": [1, 2, 3, 4, 5, 6],\n", " \"c\": [3, 4, 5, 6, 7, 8],\n", " }\n", ")\n", "df" ] }, { "cell_type": "code", "execution_count": 38, "id": "7ed9ddd1-a9b8-4154-9234-f01c57181fa9", "metadata": {}, "outputs": [], "source": [ "# a user defined aggregation function.\n", "def udaf(df):\n", " return df[\"b\"].max() - df[\"b\"].min() / 2" ] }, { "cell_type": "code", "execution_count": 39, "id": "cea4a080-4c6e-4393-831a-270a14cb79ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a\n", "1 2.5\n", "2 4.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = df.groupby(\"a\").apply(udaf, engine=\"jit\")\n", "result" ] }, { "cell_type": "markdown", "id": "74a1453e-3af0-43f8-805c-8fc5c63b5a28", "metadata": {}, "source": [ "### GroupBy JIT Engine Supported Features\n", "For cuDF to execute a UDAF through the JIT engine, several criteria must be met for the input data and UDAF itself. It is expected that many restrictions may be lifted as development proceeds." ] }, { "cell_type": "markdown", "id": "b05a3a17-abb5-487b-987e-4fcf92209ed1", "metadata": {}, "source": [ "#### Restrictions\n", "- Data containing nulls is not yet permitted. Attempting to use data containing nulls with `engine='jit'` will raise.\n", "- Broadly speaking, only 4 or 8 byte integer and float dtypes are permitted: `[\"int32, \"int64\" ,\"float32\", \"float64\"]`.\n", "- Some functions have additional dtype restrictions, such as `corr`, which does not yet support floating point dtypes. Calling `corr` with such a missing overload will raise.\n", "- If a column of an unsupported dtype is accessed and used inside a UDAF, cuDF will raise.\n", "- Operations that return new columns are not permitted within the UDAF, such as a binary operation between columns:\n", " ```python\n", " df['a'] + df['b']\n", " ```\n", " Doing so will raise. \n", "- Functions that return `Series` or `DataFrame` objects are not yet available, only functions that return scalars are permitted.\n", "- The following reductions are supported:\n", " - `max()`\n", " - `min()`\n", " - `sum()`\n", " - `mean()`\n", " - `var()`\n", " - `std()`\n", " - `idxmax()`\n", " - `idxmin()`\n", " - `corr()` (integer data only)\n", "\n", " " ] }, { "cell_type": "markdown", "id": "9c587bd2", "metadata": {}, "source": [ "### Numba kernels for DataFrames" ] }, { "cell_type": "markdown", "id": "adc6a459", "metadata": {}, "source": [ "We could apply a UDF on a DataFrame like we did above with `forall`. We'd need to write a kernel that expects multiple inputs, and pass multiple Series as arguments when we execute our kernel. Because this is fairly common and can be difficult to manage, cuDF provides two APIs to streamline this: `apply_rows` and `apply_chunks`. Below, we walk through an example of using `apply_rows`. `apply_chunks` works in a similar way, but also offers more control over low-level kernel behavior." ] }, { "cell_type": "code", "execution_count": 40, "id": "37c93c69-22d1-430f-8f02-67eef27094f8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.691674</td>\n", " <td>True</td>\n", " <td>Dan</td>\n", " <td>-0.958380</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.480099</td>\n", " <td>False</td>\n", " <td>Bob</td>\n", " <td>-0.729580</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.473370</td>\n", " <td>True</td>\n", " <td>Xavier</td>\n", " <td>-0.767454</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.067479</td>\n", " <td>True</td>\n", " <td>Alice</td>\n", " <td>-0.380205</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.970850</td>\n", " <td>False</td>\n", " <td>Sarah</td>\n", " <td>0.342905</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c e\n", "0 -0.691674 True Dan -0.958380\n", "1 0.480099 False Bob -0.729580\n", "2 -0.473370 True Xavier -0.767454\n", "3 0.067479 True Alice -0.380205\n", "4 -0.970850 False Sarah 0.342905" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = randomdata(\n", " nrows=10, dtypes={\"a\": float, \"b\": bool, \"c\": str, \"e\": float}, seed=12\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "caef6aac-7d58-41da-9e14-1eb2020f051d", "metadata": {}, "source": [ "Now that we have two numeric columns in our DataFrame, let's write a kernel that uses both of them." ] }, { "cell_type": "code", "execution_count": 41, "id": "90cbcd85", "metadata": {}, "outputs": [], "source": [ "def conditional_add(x, y, out):\n", " for i, (a, e) in enumerate(zip(x, y)):\n", " if a > 0:\n", " out[i] = a + e\n", " else:\n", " out[i] = a" ] }, { "cell_type": "markdown", "id": "bce045f2", "metadata": {}, "source": [ "Notice that we need to `enumerate` through our `zipped` function arguments (which either match or are mapped to our input column names). We can pass this kernel to `apply_rows`. We'll need to specify a few arguments:\n", "- incols\n", " - A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as `{'col1': 'arg1'}`.\n", "- outcols\n", " - A dictionary defining our output column names and their data types. These names must match our function arguments.\n", "- kwargs (optional)\n", " - We can optionally pass keyword arguments as a dictionary. Since we don't need any, we pass an empty one.\n", " \n", "While it looks like our function is looping sequentially through our columns, it actually executes in parallel in multiple threads on the GPU. This parallelism is the heart of GPU-accelerated computing. With that background, we're ready to use our UDF." ] }, { "cell_type": "code", "execution_count": 42, "id": "e782daff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>e</th>\n", " <th>out</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.691674</td>\n", " <td>True</td>\n", " <td>Dan</td>\n", " <td>-0.958380</td>\n", " <td>-0.691674</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.480099</td>\n", " <td>False</td>\n", " <td>Bob</td>\n", " <td>-0.729580</td>\n", " <td>-0.249480</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.473370</td>\n", " <td>True</td>\n", " <td>Xavier</td>\n", " <td>-0.767454</td>\n", " <td>-0.473370</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.067479</td>\n", " <td>True</td>\n", " <td>Alice</td>\n", " <td>-0.380205</td>\n", " <td>-0.312726</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.970850</td>\n", " <td>False</td>\n", " <td>Sarah</td>\n", " <td>0.342905</td>\n", " <td>-0.970850</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c e out\n", "0 -0.691674 True Dan -0.958380 -0.691674\n", "1 0.480099 False Bob -0.729580 -0.249480\n", "2 -0.473370 True Xavier -0.767454 -0.473370\n", "3 0.067479 True Alice -0.380205 -0.312726\n", "4 -0.970850 False Sarah 0.342905 -0.970850" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.apply_rows(\n", " conditional_add,\n", " incols={\"a\": \"x\", \"e\": \"y\"},\n", " outcols={\"out\": np.float64},\n", " kwargs={},\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "6b838b89", "metadata": {}, "source": [ "As expected, we see our conditional addition worked. At this point, we've successfully executed UDFs on the core data structures of cuDF." ] }, { "cell_type": "markdown", "id": "fca97003", "metadata": {}, "source": [ "### Null Handling in `apply_rows` and `apply_chunks`\n", "\n", "By default, DataFrame methods for applying UDFs like `apply_rows` will handle nulls pessimistically (all rows with a null value will be removed from the output if they are used in the kernel). Exploring how not handling not pessimistically can lead to undefined behavior is outside the scope of this guide. Suffice it to say, pessimistic null handling is the safe and consistent approach. You can see an example below." ] }, { "cell_type": "code", "execution_count": 43, "id": "befd8333", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>963</td>\n", " <td>1005</td>\n", " <td>997</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>977</td>\n", " <td>1026</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td><NA></td>\n", " <td>1026</td>\n", " <td>1019</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1078</td>\n", " <td><NA></td>\n", " <td>985</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>979</td>\n", " <td>982</td>\n", " <td>1011</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 963 1005 997\n", "1 977 1026 <NA>\n", "2 <NA> 1026 1019\n", "3 1078 <NA> 985\n", "4 979 982 1011" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def gpu_add(a, b, out):\n", " for i, (x, y) in enumerate(zip(a, b)):\n", " out[i] = x + y\n", "\n", "\n", "df = randomdata(nrows=5, dtypes={\"a\": int, \"b\": int, \"c\": int}, seed=12)\n", "df.loc[2, \"a\"] = None\n", "df.loc[3, \"b\"] = None\n", "df.loc[1, \"c\"] = None\n", "df.head()" ] }, { "cell_type": "markdown", "id": "c710ce86", "metadata": {}, "source": [ "In the dataframe above, there are three null values. Each column has a null in a different row. When we use our UDF with `apply_rows`, our output should have two nulls due to pessimistic null handling (because we're not using column `c`, the null value there does not matter to us)." ] }, { "cell_type": "code", "execution_count": 44, "id": "d1f3dcaf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>out</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>963</td>\n", " <td>1005</td>\n", " <td>997</td>\n", " <td>1968.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>977</td>\n", " <td>1026</td>\n", " <td><NA></td>\n", " <td>2003.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td><NA></td>\n", " <td>1026</td>\n", " <td>1019</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1078</td>\n", " <td><NA></td>\n", " <td>985</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>979</td>\n", " <td>982</td>\n", " <td>1011</td>\n", " <td>1961.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c out\n", "0 963 1005 997 1968.0\n", "1 977 1026 <NA> 2003.0\n", "2 <NA> 1026 1019 <NA>\n", "3 1078 <NA> 985 <NA>\n", "4 979 982 1011 1961.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.apply_rows(\n", " gpu_add, incols=[\"a\", \"b\"], outcols={\"out\": np.float64}, kwargs={}\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "53b9a2f8", "metadata": {}, "source": [ "As expected, we end up with two nulls in our output. The null values from the columns we used propagated to our output, but the null from the column we ignored did not." ] }, { "cell_type": "markdown", "id": "4bbefa67", "metadata": {}, "source": [ "## Rolling Window UDFs\n", "\n", "For time-series data, we may need to operate on a small \\\"window\\\" of our column at a time, processing each portion independently. We could slide (\\\"roll\\\") this window over the entire column to answer questions like \\\"What is the 3-day moving average of a stock price over the past year?\"\n", "\n", "We can apply more complex functions to rolling windows to `rolling` Series and DataFrames using `apply`. This example is adapted from cuDF's [API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.rolling.html). First, we'll create an example Series and then create a `rolling` object from the Series." ] }, { "cell_type": "code", "execution_count": 45, "id": "6bc6aea3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 16.0\n", "1 25.0\n", "2 36.0\n", "3 49.0\n", "4 64.0\n", "5 81.0\n", "dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser = cudf.Series([16, 25, 36, 49, 64, 81], dtype=\"float64\")\n", "ser" ] }, { "cell_type": "code", "execution_count": 46, "id": "a4c31df1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Rolling [window=3,min_periods=3,center=False]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rolling = ser.rolling(window=3, min_periods=3, center=False)\n", "rolling" ] }, { "cell_type": "markdown", "id": "ff40d863", "metadata": {}, "source": [ "Next, we'll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values." ] }, { "cell_type": "code", "execution_count": 47, "id": "eb5a081b", "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "\n", "def example_func(window):\n", " b = 0\n", " for a in window:\n", " b = max(b, math.sqrt(a))\n", " if b == 8:\n", " return 100\n", " return b" ] }, { "cell_type": "markdown", "id": "df8ba31d", "metadata": {}, "source": [ "We can execute the function by passing it to `apply`. With `window=3`, `min_periods=3`, and `center=False`, our first two values are `null`." ] }, { "cell_type": "code", "execution_count": 48, "id": "ddec3263", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 <NA>\n", "1 <NA>\n", "2 6.0\n", "3 7.0\n", "4 100.0\n", "5 9.0\n", "dtype: float64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rolling.apply(example_func)" ] }, { "cell_type": "markdown", "id": "187478db", "metadata": {}, "source": [ "We can apply this function to every column in a DataFrame, too." ] }, { "cell_type": "code", "execution_count": 49, "id": "8b61094a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>55.0</td>\n", " <td>55.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>56.0</td>\n", " <td>56.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>57.0</td>\n", " <td>57.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>58.0</td>\n", " <td>58.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>59.0</td>\n", " <td>59.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 55.0 55.0\n", "1 56.0 56.0\n", "2 57.0 57.0\n", "3 58.0 58.0\n", "4 59.0 59.0" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = cudf.DataFrame()\n", "df2[\"a\"] = np.arange(55, 65, dtype=\"float64\")\n", "df2[\"b\"] = np.arange(55, 65, dtype=\"float64\")\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": 50, "id": "bb8c3019", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>7.549834435</td>\n", " <td>7.549834435</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>7.615773106</td>\n", " <td>7.615773106</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>7.681145748</td>\n", " <td>7.681145748</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>7.745966692</td>\n", " <td>7.745966692</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>7.810249676</td>\n", " <td>7.810249676</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7.874007874</td>\n", " <td>7.874007874</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>7.937253933</td>\n", " <td>7.937253933</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>100.0</td>\n", " <td>100.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 <NA> <NA>\n", "1 <NA> <NA>\n", "2 7.549834435 7.549834435\n", "3 7.615773106 7.615773106\n", "4 7.681145748 7.681145748\n", "5 7.745966692 7.745966692\n", "6 7.810249676 7.810249676\n", "7 7.874007874 7.874007874\n", "8 7.937253933 7.937253933\n", "9 100.0 100.0" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rolling = df2.rolling(window=3, min_periods=3, center=False)\n", "rolling.apply(example_func)" ] }, { "cell_type": "markdown", "id": "d4785060", "metadata": {}, "source": [ "## GroupBy DataFrame UDFs\n", "\n", "We can also apply UDFs to grouped DataFrames using `apply_grouped`.\n", "\n", "First, we'll group our DataFrame based on column `b`, which is either True or False." ] }, { "cell_type": "code", "execution_count": 51, "id": "3dc272ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.691674</td>\n", " <td>True</td>\n", " <td>Dan</td>\n", " <td>-0.958380</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.480099</td>\n", " <td>False</td>\n", " <td>Bob</td>\n", " <td>-0.729580</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.473370</td>\n", " <td>True</td>\n", " <td>Xavier</td>\n", " <td>-0.767454</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.067479</td>\n", " <td>True</td>\n", " <td>Alice</td>\n", " <td>-0.380205</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.970850</td>\n", " <td>False</td>\n", " <td>Sarah</td>\n", " <td>0.342905</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c e\n", "0 -0.691674 True Dan -0.958380\n", "1 0.480099 False Bob -0.729580\n", "2 -0.473370 True Xavier -0.767454\n", "3 0.067479 True Alice -0.380205\n", "4 -0.970850 False Sarah 0.342905" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = randomdata(\n", " nrows=10, dtypes={\"a\": float, \"b\": bool, \"c\": str, \"e\": float}, seed=12\n", ")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 52, "id": "c0578e0a", "metadata": {}, "outputs": [], "source": [ "grouped = df.groupby([\"b\"])" ] }, { "cell_type": "markdown", "id": "4808726f", "metadata": {}, "source": [ "Next we'll define a function to apply to each group independently. In this case, we'll take the rolling average of column `e`, and call that new column `rolling_avg_e`." ] }, { "cell_type": "code", "execution_count": 53, "id": "19f0f7fe", "metadata": {}, "outputs": [], "source": [ "def rolling_avg(e, rolling_avg_e):\n", " win_size = 3\n", " for i in range(cuda.threadIdx.x, len(e), cuda.blockDim.x):\n", " if i < win_size - 1:\n", " # If there is not enough data to fill the window,\n", " # take the average to be NaN\n", " rolling_avg_e[i] = np.nan\n", " else:\n", " total = 0\n", " for j in range(i - win_size + 1, i + 1):\n", " total += e[j]\n", " rolling_avg_e[i] = total / win_size" ] }, { "cell_type": "markdown", "id": "7566f359", "metadata": {}, "source": [ "We can execute this with a very similar API to `apply_rows`. This time, though, it's going to execute independently for each group." ] }, { "cell_type": "code", "execution_count": 54, "id": "c43426c3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " <th>e</th>\n", " <th>rolling_avg_e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>0.480099</td>\n", " <td>False</td>\n", " <td>Bob</td>\n", " <td>-0.729580</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.970850</td>\n", " <td>False</td>\n", " <td>Sarah</td>\n", " <td>0.342905</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.801430</td>\n", " <td>False</td>\n", " <td>Sarah</td>\n", " <td>0.632337</td>\n", " <td>0.081887</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>-0.933157</td>\n", " <td>False</td>\n", " <td>Quinn</td>\n", " <td>-0.420826</td>\n", " <td>0.184805</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.691674</td>\n", " <td>True</td>\n", " <td>Dan</td>\n", " <td>-0.958380</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.473370</td>\n", " <td>True</td>\n", " <td>Xavier</td>\n", " <td>-0.767454</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.067479</td>\n", " <td>True</td>\n", " <td>Alice</td>\n", " <td>-0.380205</td>\n", " <td>-0.702013</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.837494</td>\n", " <td>True</td>\n", " <td>Wendy</td>\n", " <td>-0.057540</td>\n", " <td>-0.401733</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>0.913899</td>\n", " <td>True</td>\n", " <td>Ursula</td>\n", " <td>0.466252</td>\n", " <td>0.009502</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>-0.725581</td>\n", " <td>True</td>\n", " <td>George</td>\n", " <td>0.405245</td>\n", " <td>0.271319</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c e rolling_avg_e\n", "1 0.480099 False Bob -0.729580 NaN\n", "4 -0.970850 False Sarah 0.342905 NaN\n", "6 0.801430 False Sarah 0.632337 0.081887\n", "7 -0.933157 False Quinn -0.420826 0.184805\n", "0 -0.691674 True Dan -0.958380 NaN\n", "2 -0.473370 True Xavier -0.767454 NaN\n", "3 0.067479 True Alice -0.380205 -0.702013\n", "5 0.837494 True Wendy -0.057540 -0.401733\n", "8 0.913899 True Ursula 0.466252 0.009502\n", "9 -0.725581 True George 0.405245 0.271319" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = grouped.apply_grouped(\n", " rolling_avg, incols=[\"e\"], outcols=dict(rolling_avg_e=np.float64)\n", ")\n", "results" ] }, { "cell_type": "markdown", "id": "c8511306", "metadata": {}, "source": [ "Notice how, with a window size of three in the kernel, the first two values in each group for our output column are null." ] }, { "cell_type": "markdown", "id": "0060678c", "metadata": {}, "source": [ "## Numba Kernels on CuPy Arrays\n", "\n", "We can also execute Numba kernels on CuPy NDArrays, again thanks to the `__cuda_array_interface__`. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series." ] }, { "cell_type": "code", "execution_count": 55, "id": "aa6a8509", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1., 2., 3., 4., 10.])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import cupy as cp\n", "\n", "s = cudf.Series([1.0, 2, 3, 4, 10])\n", "arr = cp.asarray(s)\n", "arr" ] }, { "cell_type": "markdown", "id": "0fed556f", "metadata": {}, "source": [ "Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we'll call `out`." ] }, { "cell_type": "code", "execution_count": 56, "id": "0bb8bf93", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5\n", "1 10\n", "2 15\n", "3 20\n", "4 50\n", "dtype: int32" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@cuda.jit\n", "def multiply_by_5(x, out):\n", " i = cuda.grid(1)\n", " if i < x.size:\n", " out[i] = x[i] * 5\n", "\n", "\n", "out = cudf.Series(cp.zeros(len(s), dtype=\"int32\"))\n", "multiply_by_5.forall(s.shape[0])(s, out)\n", "out" ] }, { "cell_type": "markdown", "id": "a857b169", "metadata": {}, "source": [ "Finally, we execute the same function on our array. We allocate an empty array `out` to store our results." ] }, { "cell_type": "code", "execution_count": 57, "id": "ce60b639", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 5., 10., 15., 20., 50.])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = cp.empty_like(arr)\n", "multiply_by_5.forall(arr.size)(arr, out)\n", "out" ] }, { "cell_type": "markdown", "id": "b899d51c", "metadata": {}, "source": [ "## Caveats" ] }, { "cell_type": "markdown", "id": "fe7eb68b", "metadata": {}, "source": [ "- UDFs are currently only supported for numeric nondecimal scalar types (full support) and strings in `Series.apply` and `DataFrame.apply` (partial support, subject to the caveats outlined above). Attempting to use this API with unsupported types will raise a `TypeError`.\n", "- We do not yet fully support all arithmetic operators. Certain ops like bitwise operations are not currently implemented, but planned in future releases. If an operator is needed, a github issue should be raised so that it can be properly prioritized and implemented." ] }, { "cell_type": "markdown", "id": "c690563b", "metadata": {}, "source": [ "## Summary\n", "\n", "This guide has covered a lot of content. At this point, you should hopefully feel comfortable writing UDFs (with or without null values) that operate on\n", "\n", "- Series\n", "- DataFrame\n", "- Rolling Windows\n", "- GroupBy DataFrames\n", "- CuPy NDArrays\n", "- Numba DeviceNDArrays\n", "- Generalized NA UDFs\n", "- String UDFs\n", "\n", "\n", "For more information please see the [cuDF](https://docs.rapids.ai/api/cudf/nightly/), [Numba.cuda](https://numba.readthedocs.io/en/stable/cuda/index.html), and [CuPy](https://docs-cupy.chainer.org/en/stable/) documentation." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }