cudf.DataFrame.to_parquet#
- DataFrame.to_parquet(path, engine='cudf', compression='snappy', index=None, partition_cols=None, partition_file_name=None, partition_offsets=None, statistics='ROWGROUP', metadata_file_path=None, int96_timestamps=False, row_group_size_bytes=134217728, row_group_size_rows=None, max_page_size_bytes=None, max_page_size_rows=None, storage_options=None, return_metadata=False, *args, **kwargs)#
Write a DataFrame to the parquet format.
- Parameters
- pathstr or list of str
File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset. Use list of str with partition_offsets to write parts of the dataframe to different files.
- compression{‘snappy’, ‘ZSTD’, None}, default ‘snappy’
Name of the compression to use. Use
None
for no compression.- indexbool, default None
If
True
, include the dataframe’s index(es) in the file output. IfFalse
, they will not be written to the file. IfNone
, similar toTrue
the dataframe’s index(es) will be saved, however, instead of being saved as values anyRangeIndex
will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.- partition_colslist, optional, default None
Column names by which to partition the dataset Columns are partitioned in the order they are given
- partition_file_namestr, optional, default None
File name to use for partitioned datasets. Different partitions will be written to different directories, but all files will have this name. If nothing is specified, a random uuid4 hex string will be used for each file.
- partition_offsetslist, optional, default None
Offsets to partition the dataframe by. Should be used when path is list of str. Should be a list of integers of size
len(path) + 1
- statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’
Level at which column statistics should be included in file.
- metadata_file_pathstr, optional, default None
If specified, this function will return a binary blob containing the footer metadata of the written parquet file. The returned blob will have the
chunk.file_path
field set to themetadata_file_path
for each chunk. When using withpartition_offsets
, should be same size aslen(path)
- int96_timestampsbool, default False
If
True
, write timestamps in int96 format. This will convert timestamps from timestamp[ns], timestamp[ms], timestamp[s], and timestamp[us] to the int96 format, which is the number of Julian days and the number of nanoseconds since midnight of 1970-01-01. IfFalse
, timestamps will not be altered.- row_group_size_bytes: integer, default 134217728
Maximum size of each stripe of the output. If None, 134217728 (128.0 MB) will be used.
- row_group_size_rows: integer or None, default None
Maximum number of rows of each stripe of the output. If None, 1000000 will be used.
- max_page_size_bytes: integer or None, default None
Maximum uncompressed size of each page of the output. If None, 524288 (512KB) will be used.
- max_page_size_rows: integer or None, default None
Maximum number of rows of each page of the output. If None, 20000 will be used.
- storage_optionsdict, optional, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Request
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open
. Please seefsspec
andurllib
for more details.- return_metadatabool, default False
Return parquet metadata for written data. Returned metadata will include the file path metadata (relative to root_path). To request metadata binary blob when using with
partition_cols
, Passreturn_metadata=True
instead of specifyingmetadata_file_path
- **kwargs
Additional parameters will be passed to execution engines other than
cudf
.
See also