From 8bebab789aae65a24c7b1f6773a987a93c7a160c Mon Sep 17 00:00:00 2001
From: BInwei Yang <binwei.y@gmail.com>
Date: Mon, 27 Nov 2023 18:58:40 -0800
Subject: [PATCH] [VL] Update velox-backend-limitations.md (#3639)

---
 docs/velox-backend-limitations.md      | 133 +++++++++++--------------
 docs/velox-backend-support-progress.md |   4 +-
 2 files changed, 61 insertions(+), 76 deletions(-)

diff --git a/docs/velox-backend-limitations.md b/docs/velox-backend-limitations.md
index 9a8b1348cbb2..a17075e7d009 100644
--- a/docs/velox-backend-limitations.md
+++ b/docs/velox-backend-limitations.md
@@ -6,6 +6,12 @@ nav_order: 5
 This document describes the limitations of velox backend by listing some known cases where exception will be thrown, gluten behaves incompatibly with spark, or certain plan's execution
 must fall back to vanilla spark, etc.
 
+### Override of Spark classes
+Gluten avoids to modify Spark's existing code and use Spark APIs if possible. However, some APIs aren't exposed in Vanilla spark and we have to copy the Spark file and do the hardcode changes. The list of override classes can be found as ignoreClasses in package/pom.xml . If you use customized Spark, you may check if the files are modified in your spark, otherwise your changes will be overrided.
+
+So you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/oap-project/gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer).
+
+
 ### Runtime BloomFilter
 
 Velox BloomFilter's implementation is different from Spark's. So if `might_contain` falls back, but `bloom_filter_agg` is offloaded to velox, an exception will be thrown.
@@ -33,79 +39,76 @@ java.io.IOException: Unexpected Bloom filter version number (512)
 
 Set the gluten config `spark.gluten.sql.native.bloomFilter=false` to fall back to vanilla bloom filter, you can also disable runtime filter by setting spark config `spark.sql.optimizer.runtime.bloomFilter.enabled=false`.
 
-### ANSI (fallback behavior)
+### Fallbacks
+Except the unsupported operators, functions, file formats, data sources listed in , there are some known cases also fall back to Vanilla Spark. 
 
+#### ANSI
 Gluten currently doesn't support ANSI mode. If ANSI is enabled, Spark plan's execution will always fall back to vanilla Spark.
 
-### Case Sensitive mode (incompatible behavior)
-
+#### Case Sensitive mode
 Gluten only supports spark default case-insensitive mode. If case-sensitive mode is enabled, user may get incorrect result.
 
-### Spark's columnar reading (fatal error)
-
-If the user enables Spark's columnar reading, error can occur due to Spark's columnar vector is not compatible with
-Gluten's.
+#### Lookaround pattern for regexp functions
+In velox, lookaround (lookahead/lookbehind) pattern is not supported in RE2-based implementations for Spark functions,
+such as `rlike`, `regexp_extract`, etc.
 
-### JSON functions (incompatible behavior)
+#### FileSource format
+Currently, Gluten only fully supports parquet file format and partially support ORC. If other format is used, scan operator falls back to vanilla spark.
 
-Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, gluten will produce incorrect result.
+#### Partitioned Table Scan
+Gluten only support the partitioned table scan when the file path contain the partition info, otherwise will fall back to vanilla spark.
 
-### Lookaround pattern for regexp functions (fallback behavior)
+### incompatible behavior
+In certain cases, Gluten result may be different from Vanilla spark.
 
-In velox, lookaround (lookahead/lookbehind) pattern is not supported in RE2-based implementations for Spark functions,
-such as `rlike`, `regexp_extract`, etc.
-
-### FileSource format (fallback behavior)
-Currently, Gluten only fully supports parquet file format. If other format is used, scan operator will fall back to vanilla spark.
+#### JSON functions
+Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, gluten will produce incorrect result.
 
-### Parquet read conf (incompatible behavior)
+#### Parquet read conf
 Gluten supports `spark.files.ignoreCorruptFiles` and `spark.files.ignoreMissingFiles` with default false, if true, the behavior is same as config false.
 Gluten ignores `spark.sql.parquet.datetimeRebaseModeInRead`, it only returns what write in parquet file. It does not consider the difference between legacy
 hybrid (Julian Gregorian) calendar and Proleptic Gregorian calendar. The result may be different with vanilla spark.
 
-### Parquet write conf (incompatible behavior)
-
+#### Parquet write conf
 Spark has `spark.sql.parquet.datetimeRebaseModeInWrite` config to decide whether legacy hybrid (Julian + Gregorian) calendar 
 or Proleptic Gregorian calendar should be used during parquet writing for dates/timestamps. If the parquet to read is written
 by Spark with this config as true, Velox's TableScan will output different result when reading it back.
 
-### Partitioned Table Scan (fallback behavior)
-Gluten only support the partitioned table scan when the file path contain the partition info, otherwise will fall back to vanilla spark.
-
-### NaN support (incompatible behavior)
-Velox does NOT support NaN. So unexpected result can be obtained for a few cases, e.g., comparing a number with NaN.
+#### Partition write
 
-### File compression codec (exception)
-
-Some compression codecs are not supported in Velox on certain file format.
-Exception occurs when Velox TableScan is used to read files with unsupported compression codec.
-
-| File Format | none | zlib | zstd | snappy | lzo | lz4 | gzip |
-|-------------|------|------|------|--------|-----|-----|------|
-| Parquet     | Y    | N    | Y    | Y      | N   | N   | Y    |
-| DWRF        | Y    | Y    | Y    | Y      | Y   | Y   | N    |
+Gluten only supports static partition writes and does not support dynamic partition writes.
 
+```scala
+spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
+spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d=2) SELECT 3 as e")
+```
+Gluten does not support dynamic partition write and bucket write, Exception may be raised if you use. e.g.,
 
-### Native Write
+```scala
+spark.range(100).selectExpr("id as c1", "id % 7 as p")
+  .write
+  .format("parquet")
+  .partitionBy("p")
+  .save(f.getCanonicalPath)
+```
 
-#### Offload native write to velox
+#### CTAS write
 
-We implemented write support by overriding the following vanilla Spark classes. And you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/oap-project/gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer). It should be noted that if the user also modifies the following overriding classes, the user's changes may be overwritten.
+Velox does not create table as select. It may raise exception. e.g.,
 
+```scala
+spark.range(100).toDF("id")
+  .write
+  .format("parquet")
+  .saveAsTable("velox_ctas")
 ```
-./shims/spark32/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
-./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
-./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
-./shims/spark32/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
-./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
-./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
-./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 
-```
+#### NaN support
+Velox does NOT support NaN. So unexpected result can be obtained for a few cases, e.g., comparing a number with NaN.
 
-### Velox Parquet Write
 
-#### Configuration (incompatible behavior)
+
+#### Configuration
 
 Parquet write only support three configs, other will not take effect.
 
@@ -119,45 +122,29 @@ Parquet write only support three configs, other will not take effect.
   - sql conf: `spark.gluten.sql.native.parquet.write.blockRows`
   - option: `parquet.block.rows`
 
-#### Static partition write
 
-Velox exclusively supports static partition writes and does not support dynamic partition writes.
 
-```scala
-spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
-spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d=2) SELECT 3 as e")
-```
-
-#### Write a dynamic partitioned or bucketed table (exception)
-
-Velox does not support dynamic partition write and bucket write, e.g.,
-
-```scala
-spark.range(100).selectExpr("id as c1", "id % 7 as p")
-  .write
-  .format("parquet")
-  .partitionBy("p")
-  .save(f.getCanonicalPath)
-```
+### Fetal error caused by Spark's columnar reading
+If the user enables Spark's columnar reading, error can occur due to Spark's columnar vector is not compatible with
+Gluten's.
 
-#### CTAS (exception)
+### Exception caused by File compression codec
+Some compression codecs are not supported in Velox on certain file format.
+Exception occurs when Velox TableScan is used to read files with unsupported compression codec.
 
-Velox does not create table as select, e.g.,
+| File Format | none | zlib | zstd | snappy | lzo | lz4 | gzip |
+|-------------|------|------|------|--------|-----|-----|------|
+| Parquet     | Y    | N    | Y    | Y      | N   | N   | Y    |
+| DWRF        | Y    | Y    | Y    | Y      | Y   | Y   | N    |
 
-```scala
-spark.range(100).toDF("id")
-  .write
-  .format("parquet")
-  .saveAsTable("velox_ctas")
-```
 
 ### Spill
 
 `OutOfMemoryExcetpion` may still be triggered within current implementation of spill-to-disk feature, when shuffle partitions is set to a large number. When this case happens, please try to reduce the partition number to get rid of the OOM.
 
-### TableScan on data types
+### Unsupported Data type support in ParquetScan
 
-- Byte type (fallback behavior)
+- Byte type causes fallback to vanilla spark
 - Timestamp type
 
   Only reading with INT96 and dictionary encoding is supported. When reading INT64 represented millisecond/microsecond timestamps, or INT96 represented timestamps of other encodings, exceptions can occur.
diff --git a/docs/velox-backend-support-progress.md b/docs/velox-backend-support-progress.md
index a05df74a4024..7fabddba788a 100644
--- a/docs/velox-backend-support-progress.md
+++ b/docs/velox-backend-support-progress.md
@@ -9,9 +9,7 @@ Gluten is still in active development. Here is a list of supported operators and
 
 Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different sematics from Spark, then the function is implemented in Spark category. So Gluten firstly will use Velox's spark category, if a function isn't implemented there then refer to Presto category.
 
-The total supported functions' number for [Spark3.3 is 387](https://spark.apache.org/docs/latest/api/sql/) and for [Velox is 204](https://facebookincubator.github.io/velox/functions/coverage.html).
-Gluten supported frequently used 164, shown as below picture.
-![support](./image/support.png)
+The total supported functions' number for [Spark3.3 is 387](https://spark.apache.org/docs/latest/api/sql/), Gluten supported 189 functions now.
 
 | Value        | Description                                                                               |
 |--------------|-------------------------------------------------------------------------------------------|