Skip to content

Commit

Permalink
[VL] Update velox-backend-limitations.md (apache#3639)
Browse files Browse the repository at this point in the history
  • Loading branch information
FelixYBW authored Nov 28, 2023
1 parent 824ee53 commit 8bebab7
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 76 deletions.
133 changes: 60 additions & 73 deletions docs/velox-backend-limitations.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ nav_order: 5
This document describes the limitations of velox backend by listing some known cases where exception will be thrown, gluten behaves incompatibly with spark, or certain plan's execution
must fall back to vanilla spark, etc.

### Override of Spark classes
Gluten avoids to modify Spark's existing code and use Spark APIs if possible. However, some APIs aren't exposed in Vanilla spark and we have to copy the Spark file and do the hardcode changes. The list of override classes can be found as ignoreClasses in package/pom.xml . If you use customized Spark, you may check if the files are modified in your spark, otherwise your changes will be overrided.

So you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/oap-project/gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer).


### Runtime BloomFilter

Velox BloomFilter's implementation is different from Spark's. So if `might_contain` falls back, but `bloom_filter_agg` is offloaded to velox, an exception will be thrown.
Expand Down Expand Up @@ -33,79 +39,76 @@ java.io.IOException: Unexpected Bloom filter version number (512)

Set the gluten config `spark.gluten.sql.native.bloomFilter=false` to fall back to vanilla bloom filter, you can also disable runtime filter by setting spark config `spark.sql.optimizer.runtime.bloomFilter.enabled=false`.

### ANSI (fallback behavior)
### Fallbacks
Except the unsupported operators, functions, file formats, data sources listed in , there are some known cases also fall back to Vanilla Spark.

#### ANSI
Gluten currently doesn't support ANSI mode. If ANSI is enabled, Spark plan's execution will always fall back to vanilla Spark.

### Case Sensitive mode (incompatible behavior)

#### Case Sensitive mode
Gluten only supports spark default case-insensitive mode. If case-sensitive mode is enabled, user may get incorrect result.

### Spark's columnar reading (fatal error)

If the user enables Spark's columnar reading, error can occur due to Spark's columnar vector is not compatible with
Gluten's.
#### Lookaround pattern for regexp functions
In velox, lookaround (lookahead/lookbehind) pattern is not supported in RE2-based implementations for Spark functions,
such as `rlike`, `regexp_extract`, etc.

### JSON functions (incompatible behavior)
#### FileSource format
Currently, Gluten only fully supports parquet file format and partially support ORC. If other format is used, scan operator falls back to vanilla spark.

Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, gluten will produce incorrect result.
#### Partitioned Table Scan
Gluten only support the partitioned table scan when the file path contain the partition info, otherwise will fall back to vanilla spark.

### Lookaround pattern for regexp functions (fallback behavior)
### incompatible behavior
In certain cases, Gluten result may be different from Vanilla spark.

In velox, lookaround (lookahead/lookbehind) pattern is not supported in RE2-based implementations for Spark functions,
such as `rlike`, `regexp_extract`, etc.

### FileSource format (fallback behavior)
Currently, Gluten only fully supports parquet file format. If other format is used, scan operator will fall back to vanilla spark.
#### JSON functions
Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, gluten will produce incorrect result.

### Parquet read conf (incompatible behavior)
#### Parquet read conf
Gluten supports `spark.files.ignoreCorruptFiles` and `spark.files.ignoreMissingFiles` with default false, if true, the behavior is same as config false.
Gluten ignores `spark.sql.parquet.datetimeRebaseModeInRead`, it only returns what write in parquet file. It does not consider the difference between legacy
hybrid (Julian Gregorian) calendar and Proleptic Gregorian calendar. The result may be different with vanilla spark.

### Parquet write conf (incompatible behavior)

#### Parquet write conf
Spark has `spark.sql.parquet.datetimeRebaseModeInWrite` config to decide whether legacy hybrid (Julian + Gregorian) calendar
or Proleptic Gregorian calendar should be used during parquet writing for dates/timestamps. If the parquet to read is written
by Spark with this config as true, Velox's TableScan will output different result when reading it back.

### Partitioned Table Scan (fallback behavior)
Gluten only support the partitioned table scan when the file path contain the partition info, otherwise will fall back to vanilla spark.

### NaN support (incompatible behavior)
Velox does NOT support NaN. So unexpected result can be obtained for a few cases, e.g., comparing a number with NaN.
#### Partition write

### File compression codec (exception)

Some compression codecs are not supported in Velox on certain file format.
Exception occurs when Velox TableScan is used to read files with unsupported compression codec.

| File Format | none | zlib | zstd | snappy | lzo | lz4 | gzip |
|-------------|------|------|------|--------|-----|-----|------|
| Parquet | Y | N | Y | Y | N | N | Y |
| DWRF | Y | Y | Y | Y | Y | Y | N |
Gluten only supports static partition writes and does not support dynamic partition writes.

```scala
spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d=2) SELECT 3 as e")
```
Gluten does not support dynamic partition write and bucket write, Exception may be raised if you use. e.g.,

### Native Write
```scala
spark.range(100).selectExpr("id as c1", "id % 7 as p")
.write
.format("parquet")
.partitionBy("p")
.save(f.getCanonicalPath)
```

#### Offload native write to velox
#### CTAS write

We implemented write support by overriding the following vanilla Spark classes. And you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/oap-project/gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer). It should be noted that if the user also modifies the following overriding classes, the user's changes may be overwritten.
Velox does not create table as select. It may raise exception. e.g.,

```scala
spark.range(100).toDF("id")
.write
.format("parquet")
.saveAsTable("velox_ctas")
```
./shims/spark32/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
./shims/spark32/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
./shims/spark32/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

```
#### NaN support
Velox does NOT support NaN. So unexpected result can be obtained for a few cases, e.g., comparing a number with NaN.

### Velox Parquet Write

#### Configuration (incompatible behavior)

#### Configuration

Parquet write only support three configs, other will not take effect.

Expand All @@ -119,45 +122,29 @@ Parquet write only support three configs, other will not take effect.
- sql conf: `spark.gluten.sql.native.parquet.write.blockRows`
- option: `parquet.block.rows`

#### Static partition write

Velox exclusively supports static partition writes and does not support dynamic partition writes.

```scala
spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d=2) SELECT 3 as e")
```

#### Write a dynamic partitioned or bucketed table (exception)

Velox does not support dynamic partition write and bucket write, e.g.,

```scala
spark.range(100).selectExpr("id as c1", "id % 7 as p")
.write
.format("parquet")
.partitionBy("p")
.save(f.getCanonicalPath)
```
### Fetal error caused by Spark's columnar reading
If the user enables Spark's columnar reading, error can occur due to Spark's columnar vector is not compatible with
Gluten's.

#### CTAS (exception)
### Exception caused by File compression codec
Some compression codecs are not supported in Velox on certain file format.
Exception occurs when Velox TableScan is used to read files with unsupported compression codec.

Velox does not create table as select, e.g.,
| File Format | none | zlib | zstd | snappy | lzo | lz4 | gzip |
|-------------|------|------|------|--------|-----|-----|------|
| Parquet | Y | N | Y | Y | N | N | Y |
| DWRF | Y | Y | Y | Y | Y | Y | N |

```scala
spark.range(100).toDF("id")
.write
.format("parquet")
.saveAsTable("velox_ctas")
```

### Spill

`OutOfMemoryExcetpion` may still be triggered within current implementation of spill-to-disk feature, when shuffle partitions is set to a large number. When this case happens, please try to reduce the partition number to get rid of the OOM.

### TableScan on data types
### Unsupported Data type support in ParquetScan

- Byte type (fallback behavior)
- Byte type causes fallback to vanilla spark
- Timestamp type

Only reading with INT96 and dictionary encoding is supported. When reading INT64 represented millisecond/microsecond timestamps, or INT96 represented timestamps of other encodings, exceptions can occur.
Expand Down
4 changes: 1 addition & 3 deletions docs/velox-backend-support-progress.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ Gluten is still in active development. Here is a list of supported operators and

Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different sematics from Spark, then the function is implemented in Spark category. So Gluten firstly will use Velox's spark category, if a function isn't implemented there then refer to Presto category.

The total supported functions' number for [Spark3.3 is 387](https://spark.apache.org/docs/latest/api/sql/) and for [Velox is 204](https://facebookincubator.github.io/velox/functions/coverage.html).
Gluten supported frequently used 164, shown as below picture.
![support](./image/support.png)
The total supported functions' number for [Spark3.3 is 387](https://spark.apache.org/docs/latest/api/sql/), Gluten supported 189 functions now.

| Value | Description |
|--------------|-------------------------------------------------------------------------------------------|
Expand Down

0 comments on commit 8bebab7

Please sign in to comment.