Skip to content

Commit

Permalink
Parallelize Serialization of Columns within Parquet RowGroups (#7655)
Browse files Browse the repository at this point in the history
* merge main

* fixes and cmt

* review comments, tuning parameters, updating docs

* cargo fmt

* reduce default buffer size to 2 and update docs
  • Loading branch information
devinjdangelo authored Oct 25, 2023
1 parent b16cd93 commit 148f890
Show file tree
Hide file tree
Showing 4 changed files with 439 additions and 278 deletions.
32 changes: 26 additions & 6 deletions datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -377,12 +377,32 @@ config_namespace! {
pub bloom_filter_ndv: Option<u64>, default = None

/// Controls whether DataFusion will attempt to speed up writing
/// large parquet files by first writing multiple smaller files
/// and then stitching them together into a single large file.
/// This will result in faster write speeds, but higher memory usage.
/// Also currently unsupported are bloom filters and column indexes
/// when single_file_parallelism is enabled.
pub allow_single_file_parallelism: bool, default = false
/// parquet files by serializing them in parallel. Each column
/// in each row group in each output file are serialized in parallel
/// leveraging a maximum possible core count of n_files*n_row_groups*n_columns.
pub allow_single_file_parallelism: bool, default = true

/// By default parallel parquet writer is tuned for minimum
/// memory usage in a streaming execution plan. You may see
/// a performance benefit when writing large parquet files
/// by increasing maximum_parallel_row_group_writers and
/// maximum_buffered_record_batches_per_stream if your system
/// has idle cores and can tolerate additional memory usage.
/// Boosting these values is likely worthwhile when
/// writing out already in-memory data, such as from a cached
/// data frame.
pub maximum_parallel_row_group_writers: usize, default = 1

/// By default parallel parquet writer is tuned for minimum
/// memory usage in a streaming execution plan. You may see
/// a performance benefit when writing large parquet files
/// by increasing maximum_parallel_row_group_writers and
/// maximum_buffered_record_batches_per_stream if your system
/// has idle cores and can tolerate additional memory usage.
/// Boosting these values is likely worthwhile when
/// writing out already in-memory data, such as from a cached
/// data frame.
pub maximum_buffered_record_batches_per_stream: usize, default = 2

}
}
Expand Down
Loading

0 comments on commit 148f890

Please sign in to comment.