Perform a chunked "pack" operation of the input table_view
using a user provided buffer of size user_buffer_size
.
The intent of this operation is to be used in a streamed fashion at times of GPU out-of-memory, where we want to minimize the number of small cudaMemcpy calls and tracking of all the metadata associated with cudf tables. Because of the memory constraints, all thrust and scratch memory allocations are using the passed-in memory resource exclusively, not a per-device memory resource.
This class defines two methods that must be used in concert to carry out the chunked_pack: has_next and next. Here is an example:
// Create a table_view
cudf::table_view tv = ...;
// Choose a memory resource (optional). This memory resource is used for scratch/thrust temporary
// data. In memory constrained cases, this can be used to set aside scratch memory
// for `chunked_pack` at the beginning of a program.
auto mr = rmm::mr::get_current_device_resource();
// Define a buffer size for each chunk: the larger the buffer is, the more SMs can be
// occupied by this algorithm.
//
// Internally, the GPU unit of work is a 1MB batch. When we instantiate `cudf::chunked_pack`,
// all the 1MB batches for the source table_view are computed up front. Additionally,
// chunked_pack calculates the number of iterations that are required to go through all those
// batches given a `user_buffer_size` buffer. The number of 1MB batches in each iteration (chunk)
// equals the number of CUDA blocks that will be used for the main kernel launch.
//
std::size_t user_buffer_size = 128*1024*1024;
auto chunked_packer = cudf::chunked_pack::create(tv, user_buffer_size, mr);
std::size_t host_offset = 0;
auto host_buffer = ...; // obtain a host buffer you would like to copy to
while (chunked_packer->has_next()) {
// get a user buffer of size `user_buffer_size`
cudf::device_span<uint8_t> user_buffer = ...;
std::size_t bytes_copied = chunked_packer->next(user_buffer);
// buffer will hold the contents of at most `user_buffer_size` bytes
// of the contiguously packed input `table_view`. You are now free to copy
// this memory somewhere else, for example, to host.
cudaMemcpyAsync(
host_buffer.data() + host_offset,
user_buffer.data(),
bytes_copied,
cudaMemcpyDefault,
stream);
host_offset += bytes_copied;
}
Definition at line 185 of file contiguous_split.hpp.