Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New versions? Spark 3.3 and 3.4 - BETA? #265

Open
lukasz-kastelik opened this issue Sep 4, 2024 · 9 comments
Open

New versions? Spark 3.3 and 3.4 - BETA? #265

lukasz-kastelik opened this issue Sep 4, 2024 · 9 comments

Comments

@lukasz-kastelik
Copy link

I am representing one of large Microsoft clients. We run a lot of processes on Azure Databricks and Azure SQL database. The project that I work on uses Spark 3.5 and Azure SQL. We export the data from Databricks to Azure SQL. To facilitate that we are forced to use an old version of Databricks (and Spark) that works with the GA driver.

The newest GA driver supports Spark 3.1. Also there are drivers for Spark 3.3 and 3.4 in BETA. Are there any plans whatsoever to:

  1. Release the SQL driver that is compatible with Spark 3.5+
  2. Upgrade the Spark 3.3 and 3.4 drivers to GA. Do you expect us to run beta drivers in production?
  3. Continue with the project development in general. Nothing has been going on here for months.

I am aware that this project is open-source. However, it is a critical component for Microsoft services like Azure SQL and Azure Databricks (probably Synapse/Fabric too). We expect Microsoft to take it seriously and push this project forward.

@boblannon
Copy link

In fact I don't even see the 3.4-compatible driver (1.4.0) on maven?

image

Databricks can't find 1.4.0 or 1.4.0-BETA either
image

@dbeavon
Copy link

dbeavon commented Oct 7, 2024

Microsoft is freeloading on OSS communities in the big data space. They don't contribute back to these projects in a significant way, but are happy to re-brand massive amounts of OSS for the sake of their "Fabric" platform (ie. for python notebooks, Apache Airflow, Apache Spark, and lots more!)

I don't think they care if you are representing one of their large clients. They just want you to move your stuff to "Fabric", which is where they are eager to make their new investments. This is almost certainly about profit margins. It is counterproductive if they help you to become more successful on Azure Databricks (since it will delay your migration to "Fabric").

On a more productive note, did you try just using the vanilla JDBC connector? See the end of this issue:
#191

... I believe there have been some improvements in JDBC to support features you would otherwise get thru the custom spark connector. Eg. I believe the JDBC now performs bulk loads, based on the observed performance. I haven't done any xevent traces to confirm yet, but will take a look this week.

@Rafnel
Copy link

Rafnel commented Oct 7, 2024

Hey @dbeavon , I happen to find myself in this thread around the same time as you, as I've spent all day investigating bulk inserts using Spark's JDBC driver. I don't think the base Spark JDBC driver for SQL Server supports bulk operations still, so I'm not sure what this comment is based on #191 (comment). I believe it's still using single row-by-row inserts. Although if you find out something different please do let me know. I unfortunately find myself using Spark for .NET which has similarly been abandoned by MSFT and was looking for a way to configure it to do bulk inserts into SQL Server and found this package, which also seems potentially abandoned, heh.

@dbeavon
Copy link

dbeavon commented Oct 7, 2024

Hi @Rafnel ,

Yes, I just finished some investigation into xevents using azure data studio and so far I'm not seeing bulk inserts either.

In retrospect, I suspect the comment wasn't referring to actual TDS protocol for bulk insert. I suspect the guy was probably just impressed by the row-by-row inserts on multiple executors. The performance of those inserts is admittedly pretty impressive, and can probably handle one million rows in 30 seconds, assuming adequate network infrastructure and no IO throttling on data or logs.

.Net for Spark
Hope you don't give up on .Net for Spark. It is an uphill battle, but I'm convinced the project will probably be resurrected again in some form. I think the "Spark Connect" is going to be a well-accepted way to run a driver using .Net code. And it is only a matter of time before some other large team realizes how important it is for UDF's to be written in c# as well. It was the ability to write closures which really sold me on using C# for my UDF's. AOT performance in .Net 8 would be amazing as well.

... I'm still using .Net for Spark on HDI 5.1 running Spark 3.3. It was a bit tricky, but well worth the effort considering how much .Net code we would otherwise have to downgrade to Python or Scala or Java. I'm also going to try to get .Net 8 working with AOT (still using .Net core 3.1 at the moment).

@dbeavon
Copy link

dbeavon commented Oct 7, 2024

@Rafnel Here is what I discovered about the base JDBC driver for Spark.

The network traffic is fairly efficient when looking at wireshark, the traffic seems to be batched up.

image

My guess is that the network traffic can be fine-tuned using the "batchsize" option which is pretty well documented and probably works for any JDBC database:
https://downloads.apache.org/spark/docs/3.3.1/sql-data-sources-jdbc.html

On the SQL Server side (receiving data) things are working in a less desirable way. It appears that the records are inserted one at a time, after coming over the network in batch:

image

I suspect the biggest problem with this approach is the CPU usage on the SQL side. Ideally we wouldn't boggleneck on CPU when inserting large amounts of data; but it should rather be pipelined and so we spend most of our time waiting on data and log I/O. I'm guessing the bottleneck will be on CPU, when using this approach. And it will force a higher service tier.

Having said all this, the performance is very good, and I'm guessing that it would be acceptable for 95% of scenarios, assuming you have money to spend on more CPU, and you aren't dealing with inserting 100's of millions of rows at a time.

In my case I also lock the whole table, and I think that is done once for each executor session (SPID) . I'm guessing it will decrease the CPU impact of this approach to a certain degree. I haven't actually measured the difference yet.

`

        // Locking and FK implementation.
        LocalJdbcOptions.Add("tableLock", "true");
        LocalJdbcOptions.Add("checkConstraints", (p_CheckForeignConstraints ? "true" : "false"));

        // Size of batches
        LocalJdbcOptions.Add("batchsize", "10000");

`

Hopefully this is clear. I plan to use the JDBC by default, and will always specify a very large batch size. And I'm hoping I can also switch over to the BETA of 1.3.0 in maven, as a last resort, if the bulk inserts get too large or use too much CPU.

Here is that 1.3.0

https://mvnrepository.com/artifact/com.microsoft.azure/spark-mssql-connector_2.12

@dbeavon
Copy link

dbeavon commented Oct 7, 2024

Oddly the readme file claims there are maven coordinates for 1.3.0 and 1.4.0 but we can't find those in maven yet (only the BETA):

image

... I'm guessing these are just referring to the beta's that are available here on this community. (listed under releases)

https://github.com/microsoft/sql-spark-connector/releases

At least this project is taking PR's. That is a lot better than nothing.

Hopefully this project will get some TLC when Microsoft is preparing to release the next version of HDInsight.

@Rafnel
Copy link

Rafnel commented Oct 8, 2024

Thanks for the detailed response @dbeavon ! Glad to have full confirmation on the RBAR vs bulk inserts with the standard JDBC connector.

We originally chose Spark for .NET because 90% of our stack is .NET and we figured it would be best to follow that convention to reduce the barrier for other devs to use our project. For now we will be sticking with it simply because it's not a high priority to rewrite the project, so maybe Microsoft will revive it before we have time to really consider rewriting it. It is sad that they seem to have quietly abandoned some of these open source projects.

@dbeavon
Copy link

dbeavon commented Oct 9, 2024

@Rafnel Where do you host your .Net for Spark?

You might get tripped up when trying to push beyond spark 3.1. I've spent about two days adding just a few lines of code to support 3.3 on the scala and .net side of things. That's how these things go sometimes ;-) But I believe that I'm now past the worst of it.

I really love .Net for Spark. Given that c#.Net is being used for blazer wasm (in a web browser) ... then it can certainly be used on a spark executor as well. I think someone just needs to work with the OSS spark community and discover the real reason why they are objecting to a full parity .Net interface.

When it comes down to looking at the executor interop for java/python/R, there really isn't ton of code here, and there isn't a good reason to leave c#.Net out in the cold.

https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/api/java

image

... I think it is just politics, and it might require us to simply start nagging at that community until they change their minds. Recently the spark core was re-written by databricks (aka "photon"). Since that happened, I suspect that even the jvm might require the use of an external interop layer ... ie. it is in the exact same boat as C#.Net and python.

@RevngeSevnFold
Copy link

I see there was a PR done for support to spark 3.5, in September. :( But its seems like its not being reviewed by anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants