Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1743: Upgrade Spark to 4.0.0-preview1 #1909

Closed
wants to merge 6 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Apr 24, 2024

What changes were proposed in this pull request?

This PR aims to upgrade the benchmark module to use Spark 4.0.0-preview1.

Why are the changes needed?

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

No

@cxzl25 cxzl25 marked this pull request as draft April 24, 2024 08:42
<exclude>META-INF/DUMMY.DSA</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING] eclipse-collections-11.1.0.jar, eclipse-collections-api-11.1.0.jar define 4 overlapping resources:
[WARNING]   - LICENSE-EDL-1.0.txt
[WARNING]   - LICENSE-EPL-1.0.txt
[WARNING]   - META-INF/ECLIPSE_.RSA
[WARNING]   - META-INF/ECLIPSE_.SF
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Spark has fixed this problem by upgrading the arrow-vector version, so there is no modification here.

[SPARK-47981][BUILD] Upgrade Arrow to 16.0.0

@@ -74,7 +74,7 @@
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@AutoService(OrcBenchmark.class)
@Fork(jvmArgsAppend = "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED")
@Fork(jvmArgsAppend = {"--add-opens=java.base/sun.nio.ch=ALL-UNNAMED", "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caused by: java.lang.IllegalAccessException: symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo, from interface org.apache.spark.sql.catalyst.util.SparkDateTimeUtils (unnamed module @2b71fc7e)
	at java.base/java.lang.invoke.MemberName.makeAccessException(MemberName.java:955)
	at java.base/java.lang.invoke.MethodHandles$Lookup.checkSymbolicClass(MethodHandles.java:3686)
	at java.base/java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:3646)
	at java.base/java.lang.invoke.MethodHandles$Lookup.findVirtual(MethodHandles.java:2680)
	at org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.$init$(SparkDateTimeUtils.scala:206)
	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.<clinit>(DateTimeUtils.scala:41)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing this. 😄

I'd recommend to create a JIRA for migration to Scala 2.13 of Apache Spark 3.5.1 first. :)

dongjoon-hyun pushed a commit that referenced this pull request Apr 25, 2024
…mark

### What changes were proposed in this pull request?
This PR aims to migrate to Scala 2.13 of Apache Spark 3.5.1 at SparkBenchmark.

### Why are the changes needed?
#1909 (review)

### How was this patch tested?
local test

```bash
java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -format=parquet  -compress zstd -data taxi
```

```
Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt          Score       Error  Units
SparkBenchmark.partialRead                          zstd       taxi   parquet  avgt    5      17211.731 ± 11836.315  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd       taxi   parquet  avgt    5          0.002                  #
SparkBenchmark.partialRead:ops                      zstd       taxi   parquet  avgt    5         10.000                  #
SparkBenchmark.partialRead:perRecord                zstd       taxi   parquet  avgt    5          0.001 ±     0.001  us/op
SparkBenchmark.partialRead:records                  zstd       taxi   parquet  avgt    5  113791180.000                  #
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1912 from cxzl25/ORC-1704.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 25, 2024
…mark

### What changes were proposed in this pull request?
This PR aims to migrate to Scala 2.13 of Apache Spark 3.5.1 at SparkBenchmark.

### Why are the changes needed?
#1909 (review)

### How was this patch tested?
local test

```bash
java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data -format=parquet  -compress zstd -data taxi
```

```
Benchmark                                  (compression)  (dataset)  (format)  Mode  Cnt          Score       Error  Units
SparkBenchmark.partialRead                          zstd       taxi   parquet  avgt    5      17211.731 ± 11836.315  us/op
SparkBenchmark.partialRead:bytesPerRecord           zstd       taxi   parquet  avgt    5          0.002                  #
SparkBenchmark.partialRead:ops                      zstd       taxi   parquet  avgt    5         10.000                  #
SparkBenchmark.partialRead:perRecord                zstd       taxi   parquet  avgt    5          0.001 ±     0.001  us/op
SparkBenchmark.partialRead:records                  zstd       taxi   parquet  avgt    5  113791180.000                  #
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1912 from cxzl25/ORC-1704.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit dc634cb)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Hi, @cxzl25 . Sorry for asking this, but could you rebase this PR once more?

@cxzl25 cxzl25 force-pushed the support_spark_4 branch from 8afc779 to a988573 Compare May 1, 2024 02:12
@dongjoon-hyun
Copy link
Member

Thank you!

dongjoon-hyun pushed a commit that referenced this pull request May 1, 2024
…nchmark runs on JDK17

### What changes were proposed in this pull request?
This PR aims to fix `sun.util.calendar` IllegalAccessException when SparkBenchmark runs on JDK17.

### Why are the changes needed?
#1909 (comment)

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1919 from cxzl25/ORC-1707.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request May 1, 2024
…nchmark runs on JDK17

### What changes were proposed in this pull request?
This PR aims to fix `sun.util.calendar` IllegalAccessException when SparkBenchmark runs on JDK17.

### Why are the changes needed?
#1909 (comment)

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1919 from cxzl25/ORC-1707.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 5bb2346)
Signed-off-by: Dongjoon Hyun <[email protected]>
@cxzl25 cxzl25 force-pushed the support_spark_4 branch from eff9a57 to 590c8f3 Compare May 1, 2024 16:08
@cxzl25 cxzl25 force-pushed the support_spark_4 branch from d1d9f38 to 7b3cdb0 Compare June 3, 2024 13:04
@dongjoon-hyun dongjoon-hyun changed the title Test Spark 4.0.0-SNAPSHOT Test Spark 4.0.0-preview1 Jul 11, 2024
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 11, 2024

If there is no further issue, shall we finalize this PR and merge with 4.0.0-preview1, @cxzl25 ? Please let me know if this is a Draft for some blocker reasons still.

@cxzl25 cxzl25 changed the title Test Spark 4.0.0-preview1 ORC-1743: Upgrade Spark to 4.0.0-preview1 Jul 12, 2024
@cxzl25 cxzl25 marked this pull request as ready for review July 12, 2024 03:09
@cxzl25 cxzl25 force-pushed the support_spark_4 branch from 7b3cdb0 to dd74bf5 Compare July 12, 2024 03:17
@cxzl25
Copy link
Contributor Author

cxzl25 commented Jul 12, 2024

If there is no further issue, shall we finalize this PR and merge with 4.0.0-preview1

I did some validation testing locally and I think the PR is ready to be merged.

@dongjoon-hyun
Copy link
Member

Thank you!

@dongjoon-hyun dongjoon-hyun added this to the 2.0.2 milestone Jul 12, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
Merged to main/2.0.

dongjoon-hyun pushed a commit that referenced this pull request Jul 12, 2024
### What changes were proposed in this pull request?
This PR aims to upgrade the benchmark module to use Spark 4.0.0-preview1.

### Why are the changes needed?

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1909 from cxzl25/support_spark_4.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bcb25fa)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants