Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to modern spark (3.4) #166

Merged
merged 16 commits into from
Apr 10, 2024
Merged

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Aug 29, 2023

Hey y'all poked around in the project, I don't know if this is still being actively developed but if folks are interested upgraded it to Spark 3.4, I've only done the tests with the built in test suite and it's my first time poking around in the project so you know could be some mistakes.

@holdenk holdenk requested review from c-horn and a team as code owners August 29, 2023 07:06
@holdenk holdenk requested review from colindean and abs428 August 29, 2023 07:06
@colindean
Copy link
Collaborator

I really appreciate the contribution — first outside contribution in some time! For our purposes, DV must continue working with Spark 2.3 and JDK8, so we can't bump versions as you have proposed.

Can you think of a way to test/qualify this against both Spark 2.3 and 3.4? As a potential starting point, I started on crossbuilding last year but haven't had the bandwidth to complete the work.

@holdenk
Copy link
Contributor Author

holdenk commented Aug 29, 2023

So one lazy option would be to make a spark 3 and a spark 2 branch and publish a "-3" version of the artifact rather than a "proper" cross build.

@c-horn
Copy link
Collaborator

c-horn commented Aug 29, 2023

I could potentially take a look at finishing the cross-build work @colindean
Are there still users of this internally at Target who can't switch to Spark 3.X + docker?

Also many thanks @holdenk for keeping this project / others in the spark ecosystem in mind.

@holdenk
Copy link
Contributor Author

holdenk commented Aug 30, 2023

So I was thinking the main "annoying" thing in a cross build for this is probably the tests and maybe the quoted identifiers. We could probably make it cross build easily, although my default is lazy (branches).

@colindean
Copy link
Collaborator

So one lazy option would be to make a spark 3 and a spark 2 branch and publish a "-3" version of the artifact rather than a "proper" cross build.

That's a good idea, but it introduces some operational complexity to be recalled at release time. I prefer the technical complexity of an sbt crossbuild, especially if…

I could potentially take a look at finishing the cross-build work @colindean

So I was thinking the main "annoying" thing in a cross build for this is probably the tests and maybe the quoted identifiers. We could probably make it cross build easily, although my default is lazy (branches).

❤️ Yes that would be awesome. I don't remember what was going awry when I left off with it. I welcome the contribution, including pairing if either of you'd like (or you could pair the two of you if you'd like).

Are there still users of this internally at Target who can't switch to Spark 3.X + docker?

I've not seen pressure from DV users to support Spark 3, but most DV users who have identified themselves are internal to Target and stuck on our Spark 2.3 cluster while we await the production launch of our internal platform enabling use of Spark 3 + Docker.

Also many thanks @holdenk for keeping this project / others in the spark ecosystem in mind.

Yes, indeed! I'm very happy to see this and your continued work in the space, and to see another former IBMer around!

@holdenk
Copy link
Contributor Author

holdenk commented Aug 31, 2023

So my other project where I do more extensive cross-building is https://github.com/holdenk/spark-testing-base/blob/main/build.sbt , it puts the actual Spark Versions outside of sbt (passed in as a parameter) and does "classic" sbt cross-build for the scala versions. What do folks think of something like that?

Also thanks for being so friendly y'all :)

@holdenk
Copy link
Contributor Author

holdenk commented Aug 31, 2023

Also, slightly off topic, (in the context of having a large pre-Spark 3 codebase) my self and some of my coworkers have been working on some tooling at https://github.com/holdenk/spark-upgrade that might be useful once you'all have Spark 3 ready for use :)

@holdenk
Copy link
Contributor Author

holdenk commented Sep 3, 2023

A bit of work to see if folks agree with the direction, if so I'll update the tests and such :)

@holdenk
Copy link
Contributor Author

holdenk commented Sep 11, 2023

Hey just following up @c-horn do you have some bandwidth to review this?

@c-horn
Copy link
Collaborator

c-horn commented Sep 12, 2023

Hi, sorry, my free time plans got ahead of me, I will be out of the country for a few weeks starting weds. I will try to take a look at this tonight or tomorrow.

build.sbt Show resolved Hide resolved
@colindean
Copy link
Collaborator

It's been a while but I'm hoping to revisit this mid-February. I shook out some backed-up update PRs and will take a close look at this when I return from FOSDEM — @holdenk, I noted that you're speaking there and I hope to attend your talk and meet you!

@holdenk
Copy link
Contributor Author

holdenk commented Jan 30, 2024

Sounds awesome, maybe we can grab some coffee:)

@colindean
Copy link
Collaborator

Good news! The requirement to support Spark 2 is now gone as my org is moving to Spark 3 by the end of May. I'm going to return to this in the coming days/weeks as I've gotten a few requests for it.

@colindean
Copy link
Collaborator

We're going into Spark 3 at v3.5.1 and still supporting Scala 2.12 and 2.13.

.github/workflows/ci.yaml Outdated Show resolved Hide resolved
.github/workflows/release.yaml Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@colindean
Copy link
Collaborator

#186 sets some new baseline. I think I want to get this merged (probably with the suggestions to move to Spark 3.5.1 merged) and then continue from there.

@jaspal1404
Copy link

jaspal1404 commented Apr 10, 2024

@colindean @holdenk - Here is the change that is required in build.sbt to make the DV work with Spark version 3.x. I tested it and its working:

current circe-yaml version : 0.14.2
new circe-yaml version : 1.15.0

Change line number 35 in build.sbt from 0.14.2 to 1.15.0

Another thing I changed was line number 6 as below (could not be needed, wasn't sure how we are getting the spark version from the env)

sparkVersion := "3.5.1"

build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
There was no 0.14.6 for circe-yaml
@colindean colindean enabled auto-merge April 10, 2024 21:40
@colindean colindean merged commit b65ec2a into target:master Apr 10, 2024
1 check passed
@colindean
Copy link
Collaborator

Thank you so much, @holdenk, for submitting this. I'm sorry it took a while for priorities to bubble up. I'm glad that we were able to merge it eventually.

Copy link

@jaspal1404 jaspal1404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants