From e47509abb3b94b2db40d5d9b8080657839609488 Mon Sep 17 00:00:00 2001 From: PHILO-HE Date: Mon, 11 Mar 2024 19:49:05 +0800 Subject: [PATCH] [DOC] Update release & configuration doc (#4910) --- .github/workflows/dev_cron/issues_link.js | 2 +- .github/workflows/dev_cron/title_check.md | 4 +- CONTRIBUTING.md | 10 +- README.md | 10 +- docs/Configuration.md | 129 ++++++++++-------- docs/_config.yml | 2 +- docs/contact-us.md | 2 +- docs/developers/HowTo.md | 2 +- docs/developers/MicroBenchmarks.md | 6 +- docs/developers/NewToGluten.md | 4 +- docs/developers/SubstraitModifications.md | 24 ++-- docs/developers/docker_centos7.md | 2 +- docs/developers/docker_centos8.md | 2 +- docs/developers/docker_ubuntu22.04.md | 2 +- docs/get-started/ClickHouse.md | 8 +- docs/get-started/Velox.md | 6 +- docs/index.md | 2 +- docs/release.md | 9 +- docs/velox-backend-limitations.md | 4 +- .../extension/ColumnarOverrides.scala | 6 +- mkdocs.yml | 2 +- pom.xml | 2 +- .../scala/io/glutenproject/GlutenConfig.scala | 14 +- tools/gluten-it/README.md | 4 +- tools/gluten-te/centos/defaults.conf | 2 +- tools/gluten-te/ubuntu/README.md | 8 +- tools/gluten-te/ubuntu/defaults.conf | 2 +- 27 files changed, 145 insertions(+), 125 deletions(-) diff --git a/.github/workflows/dev_cron/issues_link.js b/.github/workflows/dev_cron/issues_link.js index 0b79b91a7c15..596bad758532 100644 --- a/.github/workflows/dev_cron/issues_link.js +++ b/.github/workflows/dev_cron/issues_link.js @@ -48,7 +48,7 @@ async function haveComment(github, context, pullRequestNumber, body) { } async function commentISSUESURL(github, context, pullRequestNumber, issuesID) { - const issuesURL = `https://github.com/oap-project/gluten/issues/${issuesID}`; + const issuesURL = `https://github.com/apache/incubator-gluten/issues/${issuesID}`; if (await haveComment(github, context, pullRequestNumber, issuesURL)) { return; } diff --git a/.github/workflows/dev_cron/title_check.md b/.github/workflows/dev_cron/title_check.md index 6fb45bf646ef..83d4937ed2a0 100644 --- a/.github/workflows/dev_cron/title_check.md +++ b/.github/workflows/dev_cron/title_check.md @@ -21,7 +21,7 @@ Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? -https://github.com/oap-project/gluten/issues +https://github.com/apache/incubator-gluten/issues Then could you also rename ***commit message*** and ***pull request title*** in the following format? @@ -29,5 +29,5 @@ Then could you also rename ***commit message*** and ***pull request title*** in See also: - * [Other pull requests](https://github.com/oap-project/gluten/pulls/) + * [Other pull requests](https://github.com/apache/incubator-gluten/pulls/) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index b934cdf78697..9450191dd4cb 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -44,15 +44,15 @@ please add at least one UT to ensure code quality and reduce regression issues f Please update document for your proposed code change if necessary. -If a new config property is being introduced, please update [Configuration.md](https://github.com/oap-project/gluten/blob/main/docs/Configuration.md). +If a new config property is being introduced, please update [Configuration.md](https://github.com/apache/incubator-gluten/blob/main/docs/Configuration.md). ### Code Style ##### Java/Scala code style -Developer can import the code style setting to IDE and format Java/Scala code with spotless maven plugin. See [Java/Scala code style](https://github.com/oap-project/gluten/blob/main/docs/developers/NewToGluten.md#javascala-code-style). +Developer can import the code style setting to IDE and format Java/Scala code with spotless maven plugin. See [Java/Scala code style](https://github.com/apache/incubator-gluten/blob/main/docs/developers/NewToGluten.md#javascala-code-style). ##### C/C++ code style -There are some code style conventions need to comply. See [CppCodingStyle.md](https://github.com/oap-project/gluten/blob/main/docs/developers/CppCodingStyle.md). +There are some code style conventions need to comply. See [CppCodingStyle.md](https://github.com/apache/incubator-gluten/blob/main/docs/developers/CppCodingStyle.md). For Velox backend, developer can just execute `dev/formatcppcode.sh` to format C/C++ code. It requires `clang-format-12` installed in your development env. @@ -68,7 +68,7 @@ You can execute a script to fix license header issue, as the following shows. ### Gluten CI ##### ClickHouse Backend CI -To check CI failure for CH backend, please log in with the public account/password provided [here](https://github.com/oap-project/gluten/blob/main/docs/get-started/ClickHouse.md#new-ci-system). +To check CI failure for CH backend, please log in with the public account/password provided [here](https://github.com/apache/incubator-gluten/blob/main/docs/get-started/ClickHouse.md#new-ci-system). To re-trigger CH CI, please post the below comment on PR page: `Run Gluten Clickhouse CI` @@ -79,7 +79,7 @@ To check CI failure for Velox backend, please go into the GitHub action page fro To see the perf. impact on Velox backend, you can comment `/Benchmark Velox` on PR page to trigger a pretest. The benchmark (currently TPC-H) result will be posted after completed. -If some new dependency is required to be installed, you may need to do some change for CI docker at [this folder](https://github.com/oap-project/gluten/tree/main/tools/gluten-te). +If some new dependency is required to be installed, you may need to do some change for CI docker at [this folder](https://github.com/apache/incubator-gluten/tree/main/tools/gluten-te). ### Code Review diff --git a/README.md b/README.md index cc88404d9243..0a18d10c8147 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ -# Gluten: Plugin to Double SparkSQL's Performance +# Apache Gluten (Incubating): A Middle Layer for Offloading JVM-based SQL Engines' Execution to Native Engines + [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/8452/badge)](https://www.bestpractices.dev/projects/8452) + *This project is still under active development now, and doesn't have a stable release. Welcome to evaluate it.* # 1 Introduction @@ -30,7 +32,7 @@ The basic rule of Gluten's design is that we would reuse spark's whole control f ## 1.3 Target User Gluten's target user is anyone who wants to accelerate SparkSQL fundamentally. As a plugin to Spark, Gluten doesn't require any change for dataframe API or SQL query, but only requires user to make correct configuration. -See Gluten configuration properties [here](https://github.com/oap-project/gluten/blob/main/docs/Configuration.md). +See Gluten configuration properties [here](https://github.com/apache/incubator-gluten/blob/main/docs/Configuration.md). ## 1.4 References @@ -72,7 +74,7 @@ spark-shell \ --conf spark.memory.offHeap.enabled=true \ --conf spark.memory.offHeap.size=20g \ --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \ - --jars https://github.com/oap-project/gluten/releases/download/v1.0.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04_x86_64-1.0.0.jar + --jars https://github.com/apache/incubator-gluten/releases/download/v1.0.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04_x86_64-1.0.0.jar ``` # 3.2 Custom Build @@ -118,7 +120,7 @@ Please feel free to create Github issue for reporting bug or proposing enhanceme ## 4.3 Documentation -Currently, all gluten documents are held at [docs](https://github.com/oap-project/gluten/tree/main/docs). The documents may not reflect the latest designs. Please feel free to contact us for getting design details or sharing your design ideas. +Currently, all gluten documents are held at [docs](https://github.com/apache/incubator-gluten/tree/main/docs). The documents may not reflect the latest designs. Please feel free to contact us for getting design details or sharing your design ideas. # 5 Performance diff --git a/docs/Configuration.md b/docs/Configuration.md index a6e0b7015b60..626000bc4144 100644 --- a/docs/Configuration.md +++ b/docs/Configuration.md @@ -11,75 +11,90 @@ You can add these configurations into spark-defaults.conf to enable or disable t ## Spark parameters -| Parameters | Description | Recommend Setting | -|----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------| -| spark.driver.extraClassPath | To add Gluten Plugin jar file in Spark Driver | /path/to/jar_file | -| spark.executor.extraClassPath | To add Gluten Plugin jar file in Spark Executor | /path/to/jar_file | -| spark.executor.memory | To set up how much memory to be used for Spark Executor. | | -| spark.memory.offHeap.size | To set up how much memory to be used for Java OffHeap.
Please notice Gluten Plugin will leverage this setting to allocate memory space for native usage even offHeap is disabled.
The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Gluten Plugin | 30G | -| spark.sql.sources.useV1SourceList | Choose to use V1 source | avro | -| spark.sql.join.preferSortMergeJoin | To turn off preferSortMergeJoin in Spark | false | -| spark.plugins | To load Gluten's components by Spark's plug-in loader | com.intel.oap.GlutenPlugin | -| spark.shuffle.manager | To turn on Gluten Columnar Shuffle Plugin | org.apache.spark.shuffle.sort.ColumnarShuffleManager | -| spark.gluten.enabled | Enable Gluten, default is true. Just an experimental property. Recommend to enable/disable Gluten through the setting for `spark.plugins`. | true | -| spark.gluten.sql.columnar.maxBatchSize | Number of rows to be processed in each batch. Default value is 4096. | 4096 | -| spark.gluten.memory.isolation | (Experimental) Enable isolated memory mode. If true, Gluten controls the maximum off-heap memory can be used by each task to X, X = executor memory / max task slots. It's recommended to set true if Gluten serves concurrent queries within a single session, since not all memory Gluten allocated is guaranteed to be spillable. In the case, the feature should be enabled to avoid OOM. Note when true, setting spark.memory.storageFraction to a lower value is suggested since storage memory is considered non-usable by Gluten. | false | -| spark.gluten.sql.columnar.scanOnly | When enabled, this config will overwrite all other operators' enabling, and only Scan and Filter pushdown will be offloaded to native. | false | -| spark.gluten.sql.columnar.batchscan | Enable or Disable Columnar BatchScan, default is true | true | -| spark.gluten.sql.columnar.hashagg | Enable or Disable Columnar Hash Aggregate, default is true | true | -| spark.gluten.sql.columnar.project | Enable or Disable Columnar Project, default is true | true | -| spark.gluten.sql.columnar.filter | Enable or Disable Columnar Filter, default is true | true | -| spark.gluten.sql.columnar.sort | Enable or Disable Columnar Sort, default is true | true | -| spark.gluten.sql.columnar.window | Enable or Disable Columnar Window, default is true | true | -| spark.gluten.sql.columnar.shuffledHashJoin | Enable or Disable ShuffledHashJoin, default is true | true | -| spark.gluten.sql.columnar.forceShuffledHashJoin | Force to use ShuffledHashJoin over SortMergeJoin, default is true. For queries that can benefit from storaged patitioned join, please set it to false. | true | -| spark.gluten.sql.columnar.sortMergeJoin | Enable or Disable Columnar Sort Merge Join, default is true | true | -| spark.gluten.sql.columnar.union | Enable or Disable Columnar Union, default is true | true | -| spark.gluten.sql.columnar.expand | Enable or Disable Columnar Expand, default is true | true | -| spark.gluten.sql.columnar.generate | Enable or Disable Columnar Generate, default is true | true | -| spark.gluten.sql.columnar.limit | Enable or Disable Columnar Limit, default is true | true | -| spark.gluten.sql.columnar.tableCache | Enable or Disable Columnar Table Cache, default is false | true | -| spark.gluten.sql.columnar.broadcastExchange | Enable or Disable Columnar Broadcast Exchange, default is true | true | -| spark.gluten.sql.columnar.broadcastJoin | Enable or Disable Columnar BroadcastHashJoin, default is true | true | -| spark.gluten.sql.columnar.shuffle.codec | Set up the codec to be used for Columnar Shuffle. If this configuration is not set, will check the value of spark.io.compression.codec. By default, Gluten use software compression. Valid options for software compression are lz4, zstd. Valid options for QAT and IAA is gzip. | lz4 | -| spark.gluten.sql.columnar.shuffle.codecBackend | Enable using hardware accelerators for shuffle de/compression. Valid options are QAT and IAA. | | -| spark.gluten.sql.columnar.shuffle.compressionMode | Setting different compression mode in shuffle, Valid options are buffer and rowvector, buffer option compress each buffer of RowVector individually into one pre-allocated large buffer, rowvector option first copies each buffer of RowVector to a large buffer and then compress the entire buffer in one go. | buffer | -| spark.gluten.sql.columnar.shuffle.compression.threshold | If number of rows in a batch falls below this threshold, will copy all buffers into one buffer to compress. | 100 | -| spark.gluten.sql.columnar.shuffle.realloc.threshold | Set the threshold to dynamically adjust the size of shuffle split buffers. The size of each split buffer is recalculated for each incoming batch of data. If the new size deviates from the current partition buffer size by a factor outside the range of [1 - threshold, 1 + threshold], the split buffer will be re-allocated using the newly calculated size | 0.25 | -| spark.gluten.sql.columnar.shuffle.merge.threshold | Set the threshold control the minimum merged size. When a partition buffer is full, and the number of rows is below (`threshold * spark.gluten.sql.columnar.maxBatchSize`), it will be saved for merging. | 0.25 | -| spark.gluten.sql.columnar.numaBinding | Set up NUMABinding, default is false | true | -| spark.gluten.sql.columnar.coreRange | Set up the core range for NUMABinding, only works when numaBinding set to true.
The setting is based on the number of cores in your system. Use 72 cores as an example. | 0-17,36-53 |18-35,54-71 | -| spark.gluten.sql.native.bloomFilter | Enable or Disable native runtime bloom filter. | true | -| spark.gluten.sql.columnar.wholeStage.fallback.threshold | Configure the threshold for whether whole stage will fall back in AQE supported case by counting the number of ColumnarToRow & vanilla leaf node | \>= 1 | -| spark.gluten.sql.columnar.query.fallback.threshold | Configure the threshold for whether query will fall back by counting the number of ColumnarToRow & vanilla leaf node | \>= 1 | -| spark.gluten.sql.columnar.fallback.ignoreRowToColumnar | When true, the fallback policy ignores the RowToColumnar when counting fallback number. | true | -| spark.gluten.sql.columnar.fallback.preferColumnar | When true, the fallback policy prefers to use Gluten plan rather than vanilla Spark plan if the both of them contains ColumnarToRow and the vanilla Spark plan ColumnarToRow number is not smaller than Gluten plan. | true | -| spark.gluten.sql.columnar.maxBatchSize | Set the number of rows for the output batch | 4096 | -| spark.gluten.shuffleWriter.bufferSize | Set the number of buffer rows for the shuffle writer | value of spark.gluten.sql.columnar.maxBatchSize | -| spark.gluten.loadLibFromJar | Controls whether to load dynamic link library from a packed jar for gluten/cpp. Not applicable to static build and clickhouse backend. | false | -| spark.gluten.sql.columnar.force.hashagg | Force to use hash agg to replace sort agg. | true | -| spark.gluten.sql.columnar.vanillaReaders | Enable vanilla spark's vectorized reader. Please note it may bring perf. overhead due to extra data transition. We recommend to disable it if most queries can be fully offloaded to gluten. | false | -| spark.gluten.expression.blacklist | A black list of expression to skip transform, multiple values separated by commas. | | +| Parameters | Description | Recommend Setting | +|------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------| +| spark.driver.extraClassPath | To add Gluten Plugin jar file in Spark Driver | /path/to/jar_file | +| spark.executor.extraClassPath | To add Gluten Plugin jar file in Spark Executor | /path/to/jar_file | +| spark.executor.memory | To set up how much memory to be used for Spark Executor. | | +| spark.memory.offHeap.size | To set up how much memory to be used for Java OffHeap.
Please notice Gluten Plugin will leverage this setting to allocate memory space for native usage even offHeap is disabled.
The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Gluten Plugin | 30G | +| spark.sql.sources.useV1SourceList | Choose to use V1 source | avro | +| spark.sql.join.preferSortMergeJoin | To turn off preferSortMergeJoin in Spark | false | +| spark.plugins | To load Gluten's components by Spark's plug-in loader | io.glutenproject.GlutenPlugin | +| spark.shuffle.manager | To turn on Gluten Columnar Shuffle Plugin | org.apache.spark.shuffle.sort.ColumnarShuffleManager | +| spark.gluten.enabled | Enable Gluten, default is true. Just an experimental property. Recommend to enable/disable Gluten through the setting for `spark.plugins`. | true | +| spark.gluten.sql.columnar.maxBatchSize | Number of rows to be processed in each batch. Default value is 4096. | 4096 | +| spark.gluten.memory.isolation | (Experimental) Enable isolated memory mode. If true, Gluten controls the maximum off-heap memory can be used by each task to X, X = executor memory / max task slots. It's recommended to set true if Gluten serves concurrent queries within a single session, since not all memory Gluten allocated is guaranteed to be spillable. In the case, the feature should be enabled to avoid OOM. Note when true, setting spark.memory.storageFraction to a lower value is suggested since storage memory is considered non-usable by Gluten. | false | +| spark.gluten.sql.columnar.scanOnly | When enabled, this config will overwrite all other operators' enabling, and only Scan and Filter pushdown will be offloaded to native. | false | +| spark.gluten.sql.columnar.batchscan | Enable or Disable Columnar BatchScan, default is true | true | +| spark.gluten.sql.columnar.hashagg | Enable or Disable Columnar Hash Aggregate, default is true | true | +| spark.gluten.sql.columnar.project | Enable or Disable Columnar Project, default is true | true | +| spark.gluten.sql.columnar.filter | Enable or Disable Columnar Filter, default is true | true | +| spark.gluten.sql.columnar.sort | Enable or Disable Columnar Sort, default is true | true | +| spark.gluten.sql.columnar.window | Enable or Disable Columnar Window, default is true | true | +| spark.gluten.sql.columnar.shuffledHashJoin | Enable or Disable ShuffledHashJoin, default is true | true | +| spark.gluten.sql.columnar.forceShuffledHashJoin | Force to use ShuffledHashJoin over SortMergeJoin, default is true. For queries that can benefit from storaged patitioned join, please set it to false. | true | +| spark.gluten.sql.columnar.sortMergeJoin | Enable or Disable Columnar Sort Merge Join, default is true | true | +| spark.gluten.sql.columnar.union | Enable or Disable Columnar Union, default is true | true | +| spark.gluten.sql.columnar.expand | Enable or Disable Columnar Expand, default is true | true | +| spark.gluten.sql.columnar.generate | Enable or Disable Columnar Generate, default is true | true | +| spark.gluten.sql.columnar.limit | Enable or Disable Columnar Limit, default is true | true | +| spark.gluten.sql.columnar.tableCache | Enable or Disable Columnar Table Cache, default is false | true | +| spark.gluten.sql.columnar.broadcastExchange | Enable or Disable Columnar Broadcast Exchange, default is true | true | +| spark.gluten.sql.columnar.broadcastJoin | Enable or Disable Columnar BroadcastHashJoin, default is true | true | +| spark.gluten.sql.columnar.shuffle.codec | Set up the codec to be used for Columnar Shuffle. If this configuration is not set, will check the value of spark.io.compression.codec. By default, Gluten use software compression. Valid options for software compression are lz4, zstd. Valid options for QAT and IAA is gzip. | lz4 | +| spark.gluten.sql.columnar.shuffle.codecBackend | Enable using hardware accelerators for shuffle de/compression. Valid options are QAT and IAA. | | +| spark.gluten.sql.columnar.shuffle.compressionMode | Setting different compression mode in shuffle, Valid options are buffer and rowvector, buffer option compress each buffer of RowVector individually into one pre-allocated large buffer, rowvector option first copies each buffer of RowVector to a large buffer and then compress the entire buffer in one go. | buffer | +| spark.gluten.sql.columnar.shuffle.compression.threshold | If number of rows in a batch falls below this threshold, will copy all buffers into one buffer to compress. | 100 | +| spark.gluten.sql.columnar.shuffle.realloc.threshold | Set the threshold to dynamically adjust the size of shuffle split buffers. The size of each split buffer is recalculated for each incoming batch of data. If the new size deviates from the current partition buffer size by a factor outside the range of [1 - threshold, 1 + threshold], the split buffer will be re-allocated using the newly calculated size | 0.25 | +| spark.gluten.sql.columnar.shuffle.merge.threshold | Set the threshold control the minimum merged size. When a partition buffer is full, and the number of rows is below (`threshold * spark.gluten.sql.columnar.maxBatchSize`), it will be saved for merging. | 0.25 | +| spark.gluten.sql.columnar.numaBinding | Set up NUMABinding, default is false | true | +| spark.gluten.sql.columnar.coreRange | Set up the core range for NUMABinding, only works when numaBinding set to true.
The setting is based on the number of cores in your system. Use 72 cores as an example. | 0-17,36-53 |18-35,54-71 | +| spark.gluten.sql.native.bloomFilter | Enable or Disable native runtime bloom filter. | true | +| spark.gluten.sql.columnar.wholeStage.fallback.threshold | Configure the threshold for whether whole stage will fall back in AQE supported case by counting the number of ColumnarToRow & vanilla leaf node | \>= 1 | +| spark.gluten.sql.columnar.query.fallback.threshold | Configure the threshold for whether query will fall back by counting the number of ColumnarToRow & vanilla leaf node | \>= 1 | +| spark.gluten.sql.columnar.fallback.ignoreRowToColumnar | When true, the fallback policy ignores the RowToColumnar when counting fallback number. | true | +| spark.gluten.sql.columnar.fallback.preferColumnar | When true, the fallback policy prefers to use Gluten plan rather than vanilla Spark plan if the both of them contains ColumnarToRow and the vanilla Spark plan ColumnarToRow number is not smaller than Gluten plan. | true | +| spark.gluten.sql.columnar.maxBatchSize | Set the number of rows for the output batch. | 4096 | +| spark.gluten.shuffleWriter.bufferSize | Set the number of buffer rows for the shuffle writer | value of spark.gluten.sql.columnar.maxBatchSize | +| spark.gluten.loadLibFromJar | Controls whether to load dynamic link library from a packed jar for gluten/cpp. Not applicable to static build and clickhouse backend. | false | +| spark.gluten.sql.columnar.force.hashagg | Force to use hash agg to replace sort agg. | true | +| spark.gluten.sql.columnar.vanillaReaders | Enable vanilla spark's vectorized reader. Please note it may bring perf. overhead due to extra data transition. We recommend to disable it if most queries can be fully offloaded to gluten. | false | +| spark.gluten.expression.blacklist | A black list of expression to skip transform, multiple values separated by commas. | | +| spark.gluten.sql.columnar.fallback.expressions.threshold | Fall back filter/project if the height of expression tree reaches this threshold, considering Spark codegen can bring better performance for such case. | 50 | +| spark.gluten.sql.cartesianProductTransformerEnabled | Config to enable CartesianProductExecTransformer. | true | + | spark.gluten.sql.broadcastNestedLoopJoinTransformerEnabled | Config to enable BroadcastNestedLoopJoinExecTransformer. | true | + | spark.gluten.sql.cacheWholeStageTransformerContext | When true, `WholeStageTransformer` will cache the `WholeStageTransformerContext` when executing. It is used to get substrait plan node and native plan string. | false | + | spark.gluten.sql.injectNativePlanStringToExplain | When true, Gluten will inject native plan tree to explain string inside `WholeStageTransformerContext`. | false | ## Velox Parameters The following configurations are related to Velox settings. -| Parameters | Description | Recommend Setting | -|----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------| -| spark.gluten.sql.columnar.backend.velox.bloomFilter.expectedNumItems | The default number of expected items for the velox bloomfilter. | 1000000L | -| spark.gluten.sql.columnar.backend.velox.bloomFilter.numBits | The default number of bits to use for the velox bloom filter. | 8388608L | -| spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits | The max number of bits to use for the velox bloom filter. | 4194304L | +| Parameters | Description | Recommend Setting | +|----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| +| spark.gluten.sql.columnar.backend.velox.bloomFilter.expectedNumItems | The default number of expected items for the velox bloomfilter. | 1000000L | +| spark.gluten.sql.columnar.backend.velox.bloomFilter.numBits | The default number of bits to use for the velox bloom filter. | 8388608L | +| spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits | The max number of bits to use for the velox bloom filter. | 4194304L | + | spark.gluten.sql.columnar.backend.velox.fileHandleCacheEnabled | Disables caching if false. File handle cache should be disabled if files are mutable, i.e. file content may change while file path stays the same. | | + | spark.gluten.sql.columnar.backend.velox.directorySizeGuess | Set the directory size guess for velox file scan. | | + | spark.gluten.sql.columnar.backend.velox.filePreloadThreshold | Set the file preload threshold for velox file scan. | | + | spark.gluten.sql.columnar.backend.velox.prefetchRowGroups | Set the prefetch row groups for velox file scan. | | + | spark.gluten.sql.columnar.backend.velox.loadQuantum | Set the load quantum for velox file scan. | | +| spark.gluten.sql.columnar.backend.velox.maxCoalescedDistanceBytes | Set the max coalesced distance bytes for velox file scan. | | +| spark.gluten.sql.columnar.backend.velox.maxCoalescedBytes | Set the max coalesced bytes for velox file scan. | | +| spark.gluten.sql.columnar.backend.velox.cachePrefetchMinPct | Set prefetch cache min pct for velox file scan. | | +| spark.gluten.velox.awsSdkLogLevel | Log granularity of AWS C++ SDK in velox. | FATAL | +| spark.gluten.sql.columnar.backend.velox.orc.scan.enabled | Enable velox orc scan. If disabled, vanilla spark orc scan will be used. | true | +| spark.gluten.sql.complexType.scan.fallback.enabled | Force fallback for complex type scan, including struct, map, array. | true | -Below is an example for spark-default.conf: ``` ##### Columnar Process Configuration spark.plugins io.glutenproject.GlutenPlugin spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager -spark.driver.extraClassPath ${GLUTEN_HOME}/package/target/gluten-<>-jar-with-dependencies.jar -spark.executor.extraClassPath ${GLUTEN_HOME}/package/target/gluten-<>-jar-with-dependencies.jar +spark.driver.extraClassPath ${GLUTEN_HOME}/package/target/gluten-XXX.jar +spark.executor.extraClassPath ${GLUTEN_HOME}/package/target/gluten-XXX.jar ###### ``` diff --git a/docs/_config.yml b/docs/_config.yml index c7afd9fcf45d..0d42e06f4fd1 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -16,7 +16,7 @@ remote_theme: pmarsceill/just-the-docs aux_links: "Gluten on Github": - - "//github.com/oap-project/gluten" + - "//github.com/apache/incubator-gluten" plugins: - jekyll-optional-front-matter # GitHub Pages diff --git a/docs/contact-us.md b/docs/contact-us.md index d80a6a18fab9..7c7540401c0a 100644 --- a/docs/contact-us.md +++ b/docs/contact-us.md @@ -32,4 +32,4 @@ If you need any help or have questions on this product, please contact us: ## Issues and Discussions We use github to track bugs, feature requests, and answer questions. File an -[issue](https://github.com/oap-project/gluten/issues) for a bug or feature request. +[issue](https://github.com/apache/incubator-gluten/issues) for a bug or feature request. diff --git a/docs/developers/HowTo.md b/docs/developers/HowTo.md index 27ede7fe0415..587e1b9a28aa 100644 --- a/docs/developers/HowTo.md +++ b/docs/developers/HowTo.md @@ -115,7 +115,7 @@ gdb gluten_home/cpp/build/releases/libgluten.so 'core-Executor task l-2000883-16 Now, both Parquet and DWRF format files are supported, related scripts and files are under the directory of `gluten_home/backends-velox/workload/tpch`. The file `README.md` under `gluten_home/backends-velox/workload/tpch` offers some useful help but it's still not enough and exact. -One way of run TPC-H test is to run velox-be by workflow, you can refer to [velox_be.yml](https://github.com/oap-project/gluten/blob/main/.github/workflows/velox_be.yml#L90) +One way of run TPC-H test is to run velox-be by workflow, you can refer to [velox_be.yml](https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_be.yml#L90) Here will explain how to run TPC-H on Velox backend with the Parquet file format. 1. First step, prepare the datasets, you have two choices. diff --git a/docs/developers/MicroBenchmarks.md b/docs/developers/MicroBenchmarks.md index 1fa6d79488de..5c83c76dbba7 100644 --- a/docs/developers/MicroBenchmarks.md +++ b/docs/developers/MicroBenchmarks.md @@ -58,9 +58,9 @@ Run micro benchmark with the generated files as input. You need to specify the * ```shell cd /path/to/gluten/cpp/build/velox/benchmarks ./generic_benchmark \ ---plan /home/sparkuser/github/oap-project/gluten/backends-velox/generated-native-benchmark/example.json \ ---data /home/sparkuser/github/oap-project/gluten/backends-velox/generated-native-benchmark/example_orders/part-00000-1e66fb98-4dd6-47a6-8679-8625dbc437ee-c000.snappy.parquet,\ -/home/sparkuser/github/oap-project/gluten/backends-velox/generated-native-benchmark/example_lineitem/part-00000-3ec19189-d20e-4240-85ae-88631d46b612-c000.snappy.parquet \ +--plan /home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example.json \ +--data /home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example_orders/part-00000-1e66fb98-4dd6-47a6-8679-8625dbc437ee-c000.snappy.parquet,\ +/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example_lineitem/part-00000-3ec19189-d20e-4240-85ae-88631d46b612-c000.snappy.parquet \ --threads 1 --iterations 1 --noprint-result --benchmark_filter=InputFromBatchStream ``` diff --git a/docs/developers/NewToGluten.md b/docs/developers/NewToGluten.md index 2cf67dcf625c..04074d4e6b9a 100644 --- a/docs/developers/NewToGluten.md +++ b/docs/developers/NewToGluten.md @@ -360,7 +360,7 @@ wait to attach.... # Run TPC-H and TPC-DS We supply `/tools/gluten-it` to execute these queries -Refer to [velox_be.yml](https://github.com/oap-project/gluten/blob/main/.github/workflows/velox_be.yml) +Refer to [velox_be.yml](https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_be.yml) # Run gluten+velox on clean machine @@ -371,7 +371,7 @@ spark-shell --name run_gluten \ --conf spark.plugins=io.glutenproject.GlutenPlugin \ --conf spark.memory.offHeap.enabled=true \ --conf spark.memory.offHeap.size=20g \ - --jars https://github.com/oap-project/gluten/releases/download/v1.0.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04_x86_64-1.0.0.jar \ + --jars https://github.com/apache/incubator-gluten/releases/download/v1.0.0/gluten-velox-bundle-spark3.2_2.12-ubuntu_20.04_x86_64-1.0.0.jar \ --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager ``` diff --git a/docs/developers/SubstraitModifications.md b/docs/developers/SubstraitModifications.md index 1d97d58c7b63..a0080aa8a488 100644 --- a/docs/developers/SubstraitModifications.md +++ b/docs/developers/SubstraitModifications.md @@ -17,19 +17,19 @@ alternatives like `AdvancedExtension` could be considered. ## Modifications to algebra.proto -* Added `JsonReadOptions` and `TextReadOptions` in `FileOrFiles`([#1584](https://github.com/oap-project/gluten/pull/1584)). -* Changed join type `JOIN_TYPE_SEMI` to `JOIN_TYPE_LEFT_SEMI` and `JOIN_TYPE_RIGHT_SEMI`([#408](https://github.com/oap-project/gluten/pull/408)). +* Added `JsonReadOptions` and `TextReadOptions` in `FileOrFiles`([#1584](https://github.com/apache/incubator-gluten/pull/1584)). +* Changed join type `JOIN_TYPE_SEMI` to `JOIN_TYPE_LEFT_SEMI` and `JOIN_TYPE_RIGHT_SEMI`([#408](https://github.com/apache/incubator-gluten/pull/408)). * Added `WindowRel`, added `column_name` and `window_type` in `WindowFunction`, -changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and `Unbounded_Following`, and added WindowType([#485](https://github.com/oap-project/gluten/pull/485)). -* Added `output_schema` in RelRoot([#1901](https://github.com/oap-project/gluten/pull/1901)). -* Added `ExpandRel`([#1361](https://github.com/oap-project/gluten/pull/1361)). -* Added `GenerateRel`([#574](https://github.com/oap-project/gluten/pull/574)). -* Added `PartitionColumn` in `LocalFiles`([#2405](https://github.com/oap-project/gluten/pull/2405)). -* Added `WriteRel` ([#3690](https://github.com/oap-project/gluten/pull/3690)). +changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and `Unbounded_Following`, and added WindowType([#485](https://github.com/apache/incubator-gluten/pull/485)). +* Added `output_schema` in RelRoot([#1901](https://github.com/apache/incubator-gluten/pull/1901)). +* Added `ExpandRel`([#1361](https://github.com/apache/incubator-gluten/pull/1361)). +* Added `GenerateRel`([#574](https://github.com/apache/incubator-gluten/pull/574)). +* Added `PartitionColumn` in `LocalFiles`([#2405](https://github.com/apache/incubator-gluten/pull/2405)). +* Added `WriteRel` ([#3690](https://github.com/apache/incubator-gluten/pull/3690)). ## Modifications to type.proto -* Added `Nothing` in `Type`([#791](https://github.com/oap-project/gluten/pull/791)). -* Added `names` in `Struct`([#1878](https://github.com/oap-project/gluten/pull/1878)). -* Added `PartitionColumns` in `NamedStruct`([#320](https://github.com/oap-project/gluten/pull/320)). -* Remove `PartitionColumns` and add `column_types` in `NamedStruct`([#2405](https://github.com/oap-project/gluten/pull/2405)). +* Added `Nothing` in `Type`([#791](https://github.com/apache/incubator-gluten/pull/791)). +* Added `names` in `Struct`([#1878](https://github.com/apache/incubator-gluten/pull/1878)). +* Added `PartitionColumns` in `NamedStruct`([#320](https://github.com/apache/incubator-gluten/pull/320)). +* Remove `PartitionColumns` and add `column_types` in `NamedStruct`([#2405](https://github.com/apache/incubator-gluten/pull/2405)). diff --git a/docs/developers/docker_centos7.md b/docs/developers/docker_centos7.md index 6ecc38c4cf4d..2594a8d1ff43 100644 --- a/docs/developers/docker_centos7.md +++ b/docs/developers/docker_centos7.md @@ -43,7 +43,7 @@ ln -s /usr/bin/cmake3 /usr/local/bin/cmake export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk export PATH=$JAVA_HOME/bin:$PATH -git clone https://github.com/oap-project/gluten.git +git clone https://github.com/apache/incubator-gluten.git cd gluten # To access HDFS or S3, you need to add the parameters `--enable_hdfs=ON` and `--enable_s3=ON` diff --git a/docs/developers/docker_centos8.md b/docs/developers/docker_centos8.md index 530ef295772c..dd94413bfc91 100755 --- a/docs/developers/docker_centos8.md +++ b/docs/developers/docker_centos8.md @@ -41,7 +41,7 @@ mv apache-maven-3.8.8 /usr/lib/maven export MAVEN_HOME=/usr/lib/maven export PATH=${PATH}:${MAVEN_HOME}/bin -git clone https://github.com/oap-project/gluten.git +git clone https://github.com/apache/incubator-gluten.git cd gluten # To access HDFS or S3, you need to add the parameters `--enable_hdfs=ON` and `--enable_s3=ON` diff --git a/docs/developers/docker_ubuntu22.04.md b/docs/developers/docker_ubuntu22.04.md index e1c03b45f541..478e2f792f27 100644 --- a/docs/developers/docker_ubuntu22.04.md +++ b/docs/developers/docker_ubuntu22.04.md @@ -45,7 +45,7 @@ dpkg --configure -a #export https_proxy=xxxx #clone gluten -git clone https://github.com/oap-project/gluten.git +git clone https://github.com/apache/incubator-gluten.git cd gluten/ #config maven proxy diff --git a/docs/get-started/ClickHouse.md b/docs/get-started/ClickHouse.md index 143cfb2514de..cbf9e44b2337 100644 --- a/docs/get-started/ClickHouse.md +++ b/docs/get-started/ClickHouse.md @@ -43,7 +43,7 @@ You need to install the following software manually: Then, get Gluten code: ``` - git clone https://github.com/oap-project/gluten.git + git clone https://github.com/apache/incubator-gluten.git ``` #### Setup ClickHouse backend development environment @@ -105,7 +105,7 @@ Otherwise, do: In case you don't want a develop environment, you can use the following command to compile ClickHouse backend directly: ``` -git clone https://github.com/oap-project/gluten.git +git clone https://github.com/apache/incubator-gluten.git cd gluten bash ./ep/build-clickhouse/src/build_clickhouse.sh ``` @@ -122,7 +122,7 @@ The prerequisites are the same as the one mentioned above. Compile Gluten with C - for Spark 3.2.2 ``` - git clone https://github.com/oap-project/gluten.git + git clone https://github.com/apache/incubator-gluten.git cd gluten/ export MAVEN_OPTS="-Xmx8g -XX:ReservedCodeCacheSize=2g" mvn clean install -Pbackends-clickhouse -Phadoop-2.7.4 -Pspark-3.2 -Dhadoop.version=2.8.5 -DskipTests -Dcheckstyle.skip @@ -132,7 +132,7 @@ The prerequisites are the same as the one mentioned above. Compile Gluten with C - for Spark 3.3.1 ``` - git clone https://github.com/oap-project/gluten.git + git clone https://github.com/apache/incubator-gluten.git cd gluten/ export MAVEN_OPTS="-Xmx8g -XX:ReservedCodeCacheSize=2g" mvn clean install -Pbackends-clickhouse -Phadoop-2.7.4 -Pspark-3.3 -Dhadoop.version=2.8.5 -DskipTests -Dcheckstyle.skip diff --git a/docs/get-started/Velox.md b/docs/get-started/Velox.md index be6c00a549e7..79ea501da5e2 100644 --- a/docs/get-started/Velox.md +++ b/docs/get-started/Velox.md @@ -50,7 +50,7 @@ export PATH=$JAVA_HOME/bin:$PATH ## config maven, like proxy in ~/.m2/settings.xml ## fetch gluten code -git clone https://github.com/oap-project/gluten.git +git clone https://github.com/apache/incubator-gluten.git ``` # Build Gluten with Velox Backend @@ -152,7 +152,7 @@ cp /path/to/hdfs-client.xml hdfs-client.xml One typical deployment on Spark/HDFS cluster is to enable [short-circuit reading](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html). Short-circuit reads provide a substantial performance boost to many applications. -By default libhdfs3 does not set the default hdfs domain socket path to support HDFS short-circuit read. If this feature is required in HDFS setup, users may need to setup the domain socket path correctly by patching the libhdfs3 source code or by setting the correct config environment. In Gluten the short-circuit domain socket path is set to "/var/lib/hadoop-hdfs/dn_socket" in [build_velox.sh](https://github.com/oap-project/gluten/blob/main/ep/build-velox/src/build_velox.sh) So we need to make sure the folder existed and user has write access as below script. +By default libhdfs3 does not set the default hdfs domain socket path to support HDFS short-circuit read. If this feature is required in HDFS setup, users may need to setup the domain socket path correctly by patching the libhdfs3 source code or by setting the correct config environment. In Gluten the short-circuit domain socket path is set to "/var/lib/hadoop-hdfs/dn_socket" in [build_velox.sh](https://github.com/apache/incubator-gluten/blob/main/ep/build-velox/src/build_velox.sh) So we need to make sure the folder existed and user has write access as below script. ``` sudo mkdir -p /var/lib/hadoop-hdfs/ @@ -299,7 +299,7 @@ Spark3.3 has 387 functions in total. ~240 are commonly used. Velox's functions h To identify what can be offloaded in a query and detailed fallback reasons, user can follow below steps to retrieve corresponding logs. ``` -1) Enable Gluten by proper [configuration](https://github.com/oap-project/gluten/blob/main/docs/Configuration.md). +1) Enable Gluten by proper [configuration](https://github.com/apache/incubator-gluten/blob/main/docs/Configuration.md). 2) Disable Spark AQE to trigger plan validation in Gluten spark.sql.adaptive.enabled = false diff --git a/docs/index.md b/docs/index.md index 9e5cc243aa6e..fc66717bc0e9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -36,7 +36,7 @@ The basic rule of Gluten's design is that we would reuse spark's whole control f ## 1.3 Target User Gluten's target user is anyone who wants to accelerate SparkSQL fundamentally. As a plugin to Spark, Gluten doesn't require any change for dataframe API or SQL query, but only requires user to make correct configuration. -See Gluten configuration properties [here](https://github.com/oap-project/gluten/blob/main/docs/Configuration.md). +See Gluten configuration properties [here](https://github.com/apache/incubator-gluten/blob/main/docs/Configuration.md). ## 1.4 References diff --git a/docs/release.md b/docs/release.md index f8930ae43a6f..a3f20bde857e 100644 --- a/docs/release.md +++ b/docs/release.md @@ -4,11 +4,12 @@ title: Gluten Release nav_order: 11 --- -[Gluten](https://github.com/oap-project/gluten) is a plugin for Apache Spark to double SparkSQL's performance. +[Gluten](https://github.com/apache/incubator-gluten) is a plugin for Apache Spark to double SparkSQL's performance. ## Latest release for velox backend -* [Gluten-1.1.0](https://github.com/oap-project/gluten/releases/tag/v1.1.0) (Nov. 30 2023) +* [Gluten-1.1.1](https://github.com/apache/incubator-gluten/releases/tag/v1.1.1) (Mar. 2 2024) ## Archived releases -* [Gluten-1.0.0](https://github.com/oap-project/gluten/releases/tag/v1.0.0) (Jul. 13 2023) -* [Gluten-0.5.0](https://github.com/oap-project/gluten/releases/tag/0.5.0) (Apr. 7 2023). +* [Gluten-1.1.0](https://github.com/apache/incubator-gluten/releases/tag/v1.1.0) (Nov. 30 2023) +* [Gluten-1.0.0](https://github.com/apache/incubator-gluten/releases/tag/v1.0.0) (Jul. 13 2023) +* [Gluten-0.5.0](https://github.com/apache/incubator-gluten/releases/tag/0.5.0) (Apr. 7 2023) diff --git a/docs/velox-backend-limitations.md b/docs/velox-backend-limitations.md index fb3b0f16f677..7b03f3b2f12c 100644 --- a/docs/velox-backend-limitations.md +++ b/docs/velox-backend-limitations.md @@ -9,9 +9,9 @@ must fall back to vanilla spark, etc. ### Override of Spark classes (For Spark3.2 and Spark3.3) Gluten avoids to modify Spark's existing code and use Spark APIs if possible. However, some APIs aren't exposed in Vanilla spark and we have to copy the Spark file and do the hardcode changes. The list of override classes can be found as ignoreClasses in package/pom.xml . If you use customized Spark, you may check if the files are modified in your spark, otherwise your changes will be overrided. -So you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/oap-project/gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer). +So you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer). -If not officially supported spark3.2/3.3 version is used, NoSuchMethodError can be thrown at runtime. More details see [issue-4514](https://github.com/oap-project/gluten/issues/4514). +If not officially supported spark3.2/3.3 version is used, NoSuchMethodError can be thrown at runtime. More details see [issue-4514](https://github.com/apache/incubator-gluten/issues/4514). ### Fallbacks Except the unsupported operators, functions, file formats, data sources listed in , there are some known cases also fall back to Vanilla Spark. diff --git a/gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala b/gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala index ff231585b07b..e3478720a4ad 100644 --- a/gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala +++ b/gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala @@ -135,7 +135,8 @@ object ColumnarOverrideRules { case ColumnarToRowExec(DummyColumnarOutputExec(_)) => false case _ => throw new IllegalStateException( - "This should not happen. Please leave a issue at https://github.com/oap-project/gluten.") + "This should not happen. Please leave a issue at" + + " https://github.com/apache/incubator-gluten.") } def unwrap(plan: SparkPlan): SparkPlan = plan match { @@ -145,7 +146,8 @@ object ColumnarOverrideRules { case ColumnarToRowExec(DummyColumnarOutputExec(child)) => child case _ => throw new IllegalStateException( - "This should not happen. Please leave a issue at https://github.com/oap-project/gluten.") + "This should not happen. Please leave a issue at" + + " https://github.com/apache/incubator-gluten.") } } } diff --git a/mkdocs.yml b/mkdocs.yml index bdf856e07df6..1c03a1ce600a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,7 +15,7 @@ site_name: Gluten repo_name: 'Fork on GitHub ' -repo_url: "https://github.com/oap-project/gluten.git" +repo_url: "https://github.com/apache/incubator-gluten.git" edit_uri: "" diff --git a/pom.xml b/pom.xml index 379d5434bba0..be1a3544d9a6 100644 --- a/pom.xml +++ b/pom.xml @@ -20,7 +20,7 @@ pom Gluten Parent Pom - https://github.com/oap-project/gluten.git + https://github.com/apache/incubator-gluten.git diff --git a/shims/common/src/main/scala/io/glutenproject/GlutenConfig.scala b/shims/common/src/main/scala/io/glutenproject/GlutenConfig.scala index bba16aa8bda3..87f5fc4f0b0b 100644 --- a/shims/common/src/main/scala/io/glutenproject/GlutenConfig.scala +++ b/shims/common/src/main/scala/io/glutenproject/GlutenConfig.scala @@ -1602,28 +1602,28 @@ object GlutenConfig { val DIRECTORY_SIZE_GUESS = buildStaticConf("spark.gluten.sql.columnar.backend.velox.directorySizeGuess") .internal() - .doc(" Set the directory size guess for velox file scan") + .doc("Set the directory size guess for velox file scan") .intConf .createOptional val FILE_PRELOAD_THRESHOLD = buildStaticConf("spark.gluten.sql.columnar.backend.velox.filePreloadThreshold") .internal() - .doc(" Set the file preload threshold for velox file scan") + .doc("Set the file preload threshold for velox file scan") .intConf .createOptional val PREFETCH_ROW_GROUPS = buildStaticConf("spark.gluten.sql.columnar.backend.velox.prefetchRowGroups") .internal() - .doc(" Set the prefetch row groups for velox file scan") + .doc("Set the prefetch row groups for velox file scan") .intConf .createOptional val LOAD_QUANTUM = buildStaticConf("spark.gluten.sql.columnar.backend.velox.loadQuantum") .internal() - .doc(" Set the load quantum for velox file scan") + .doc("Set the load quantum for velox file scan") .intConf .createOptional @@ -1637,14 +1637,14 @@ object GlutenConfig { val MAX_COALESCED_BYTES = buildStaticConf("spark.gluten.sql.columnar.backend.velox.maxCoalescedBytes") .internal() - .doc(" Set the max coalesced bytes for velox file scan") + .doc("Set the max coalesced bytes for velox file scan") .intConf .createOptional val CACHE_PREFETCH_MINPCT = buildStaticConf("spark.gluten.sql.columnar.backend.velox.cachePrefetchMinPct") .internal() - .doc(" Set prefetch cache min pct for velox file scan") + .doc("Set prefetch cache min pct for velox file scan") .intConf .createOptional @@ -1658,7 +1658,7 @@ object GlutenConfig { val VELOX_ORC_SCAN_ENABLED = buildStaticConf("spark.gluten.sql.columnar.backend.velox.orc.scan.enabled") .internal() - .doc(" Enable velox orc scan. If disabled, vanilla spark orc scan will be used.") + .doc("Enable velox orc scan. If disabled, vanilla spark orc scan will be used.") .booleanConf .createWithDefault(true) diff --git a/tools/gluten-it/README.md b/tools/gluten-it/README.md index 39e0617023aa..59ae55e14f18 100644 --- a/tools/gluten-it/README.md +++ b/tools/gluten-it/README.md @@ -6,13 +6,13 @@ The project makes it easy to test Gluten build locally. Gluten is a native Spark SQL implementation as a standard Spark plug-in. -https://github.com/oap-project/gluten +https://github.com/apache/incubator-gluten ## Getting Started ### 1. Install Gluten in your local machine -See official Gluten build guidance https://github.com/oap-project/gluten#how-to-use-gluten +See official Gluten build guidance https://github.com/apache/incubator-gluten#how-to-use-gluten ### 2. Install and run gluten-it with Spark version diff --git a/tools/gluten-te/centos/defaults.conf b/tools/gluten-te/centos/defaults.conf index 19e1b238a4e9..1213ff66d6f7 100755 --- a/tools/gluten-te/centos/defaults.conf +++ b/tools/gluten-te/centos/defaults.conf @@ -11,7 +11,7 @@ DEFAULT_NON_INTERACTIVE=OFF DEFAULT_PRESERVE_CONTAINER=OFF # The codes will be used in build -DEFAULT_GLUTEN_REPO=https://github.com/oap-project/gluten.git +DEFAULT_GLUTEN_REPO=https://github.com/apache/incubator-gluten.git DEFAULT_GLUTEN_BRANCH=main # Create debug build diff --git a/tools/gluten-te/ubuntu/README.md b/tools/gluten-te/ubuntu/README.md index 328b5108e73d..f617d8368675 100644 --- a/tools/gluten-te/ubuntu/README.md +++ b/tools/gluten-te/ubuntu/README.md @@ -1,6 +1,6 @@ # Portable Test Environment of Gluten (gluten-te) -Build and run [gluten](https://github.com/oap-project/gluten) and [gluten-it](https://github.com/oap-project/gluten/tree/main/tools/gluten-it) in a portable docker container, from scratch. +Build and run [gluten](https://github.com/apache/incubator-gluten) and [gluten-it](https://github.com/apache/incubator-gluten/tree/main/tools/gluten-it) in a portable docker container, from scratch. # Prerequisites @@ -9,7 +9,7 @@ Only Linux and MacOS are currently supported. Before running the scripts, make s # Getting Started (Build Gluten code, Velox backend) ```sh -git clone -b main https://github.com/oap-project/gluten.git gluten # Gluten main code +git clone -b main https://github.com/apache/incubator-gluten.git gluten # Gluten main code export HTTP_PROXY_HOST=myproxy.example.com # in case you are behind http proxy export HTTP_PROXY_PORT=55555 # in case you are behind http proxy @@ -21,7 +21,7 @@ tools/gluten-te/ubuntu/examples/buildhere-veloxbe/run.sh # Getting Started (TPC, Velox backend) ```sh -git clone -b main https://github.com/oap-project/gluten.git gluten # Gluten main code +git clone -b main https://github.com/apache/incubator-gluten.git gluten # Gluten main code export HTTP_PROXY_HOST=myproxy.example.com # in case you are behind http proxy export HTTP_PROXY_PORT=55555 # in case you are behind http proxy @@ -32,7 +32,7 @@ cd gluten/gluten-te # Configurations -See the [config file](https://github.com/oap-project/gluten/blob/main/tools/gluten-te/ubuntu/defaults.conf). You can modify the file to configure gluten-te, or pass env variables during running the scripts. +See the [config file](https://github.com/apache/incubator-gluten/blob/main/tools/gluten-te/ubuntu/defaults.conf). You can modify the file to configure gluten-te, or pass env variables during running the scripts. # Example Usages diff --git a/tools/gluten-te/ubuntu/defaults.conf b/tools/gluten-te/ubuntu/defaults.conf index 4f4904ad6e5d..2656b1cfa065 100644 --- a/tools/gluten-te/ubuntu/defaults.conf +++ b/tools/gluten-te/ubuntu/defaults.conf @@ -11,7 +11,7 @@ DEFAULT_NON_INTERACTIVE=OFF DEFAULT_PRESERVE_CONTAINER=OFF # The codes will be used in build -DEFAULT_GLUTEN_REPO=https://github.com/oap-project/gluten.git +DEFAULT_GLUTEN_REPO=https://github.com/apache/incubator-gluten.git DEFAULT_GLUTEN_BRANCH=main # Create debug build