New versions? Spark 3.3 and 3.4 - BETA? #265

lukasz-kastelik · 2024-09-04T12:05:09Z

I am representing one of large Microsoft clients. We run a lot of processes on Azure Databricks and Azure SQL database. The project that I work on uses Spark 3.5 and Azure SQL. We export the data from Databricks to Azure SQL. To facilitate that we are forced to use an old version of Databricks (and Spark) that works with the GA driver.

The newest GA driver supports Spark 3.1. Also there are drivers for Spark 3.3 and 3.4 in BETA. Are there any plans whatsoever to:

Release the SQL driver that is compatible with Spark 3.5+
Upgrade the Spark 3.3 and 3.4 drivers to GA. Do you expect us to run beta drivers in production?
Continue with the project development in general. Nothing has been going on here for months.

I am aware that this project is open-source. However, it is a critical component for Microsoft services like Azure SQL and Azure Databricks (probably Synapse/Fabric too). We expect Microsoft to take it seriously and push this project forward.

boblannon · 2024-09-18T07:20:47Z

In fact I don't even see the 3.4-compatible driver (1.4.0) on maven?

Databricks can't find 1.4.0 or 1.4.0-BETA either

dbeavon · 2024-10-07T18:36:11Z

Microsoft is freeloading on OSS communities in the big data space. They don't contribute back to these projects in a significant way, but are happy to re-brand massive amounts of OSS for the sake of their "Fabric" platform (ie. for python notebooks, Apache Airflow, Apache Spark, and lots more!)

I don't think they care if you are representing one of their large clients. They just want you to move your stuff to "Fabric", which is where they are eager to make their new investments. This is almost certainly about profit margins. It is counterproductive if they help you to become more successful on Azure Databricks (since it will delay your migration to "Fabric").

On a more productive note, did you try just using the vanilla JDBC connector? See the end of this issue:
#191

... I believe there have been some improvements in JDBC to support features you would otherwise get thru the custom spark connector. Eg. I believe the JDBC now performs bulk loads, based on the observed performance. I haven't done any xevent traces to confirm yet, but will take a look this week.

Rafnel · 2024-10-07T21:07:01Z

Hey @dbeavon , I happen to find myself in this thread around the same time as you, as I've spent all day investigating bulk inserts using Spark's JDBC driver. I don't think the base Spark JDBC driver for SQL Server supports bulk operations still, so I'm not sure what this comment is based on #191 (comment). I believe it's still using single row-by-row inserts. Although if you find out something different please do let me know. I unfortunately find myself using Spark for .NET which has similarly been abandoned by MSFT and was looking for a way to configure it to do bulk inserts into SQL Server and found this package, which also seems potentially abandoned, heh.

dbeavon · 2024-10-07T22:13:08Z

Hi @Rafnel ,

Yes, I just finished some investigation into xevents using azure data studio and so far I'm not seeing bulk inserts either.

In retrospect, I suspect the comment wasn't referring to actual TDS protocol for bulk insert. I suspect the guy was probably just impressed by the row-by-row inserts on multiple executors. The performance of those inserts is admittedly pretty impressive, and can probably handle one million rows in 30 seconds, assuming adequate network infrastructure and no IO throttling on data or logs.

.Net for Spark
Hope you don't give up on .Net for Spark. It is an uphill battle, but I'm convinced the project will probably be resurrected again in some form. I think the "Spark Connect" is going to be a well-accepted way to run a driver using .Net code. And it is only a matter of time before some other large team realizes how important it is for UDF's to be written in c# as well. It was the ability to write closures which really sold me on using C# for my UDF's. AOT performance in .Net 8 would be amazing as well.

... I'm still using .Net for Spark on HDI 5.1 running Spark 3.3. It was a bit tricky, but well worth the effort considering how much .Net code we would otherwise have to downgrade to Python or Scala or Java. I'm also going to try to get .Net 8 working with AOT (still using .Net core 3.1 at the moment).

dbeavon · 2024-10-07T22:57:02Z

@Rafnel Here is what I discovered about the base JDBC driver for Spark.

The network traffic is fairly efficient when looking at wireshark, the traffic seems to be batched up.

My guess is that the network traffic can be fine-tuned using the "batchsize" option which is pretty well documented and probably works for any JDBC database:
https://downloads.apache.org/spark/docs/3.3.1/sql-data-sources-jdbc.html

On the SQL Server side (receiving data) things are working in a less desirable way. It appears that the records are inserted one at a time, after coming over the network in batch:

I suspect the biggest problem with this approach is the CPU usage on the SQL side. Ideally we wouldn't boggleneck on CPU when inserting large amounts of data; but it should rather be pipelined and so we spend most of our time waiting on data and log I/O. I'm guessing the bottleneck will be on CPU, when using this approach. And it will force a higher service tier.

Having said all this, the performance is very good, and I'm guessing that it would be acceptable for 95% of scenarios, assuming you have money to spend on more CPU, and you aren't dealing with inserting 100's of millions of rows at a time.

In my case I also lock the whole table, and I think that is done once for each executor session (SPID) . I'm guessing it will decrease the CPU impact of this approach to a certain degree. I haven't actually measured the difference yet.

`

        // Locking and FK implementation.
        LocalJdbcOptions.Add("tableLock", "true");
        LocalJdbcOptions.Add("checkConstraints", (p_CheckForeignConstraints ? "true" : "false"));

        // Size of batches
        LocalJdbcOptions.Add("batchsize", "10000");

`

Hopefully this is clear. I plan to use the JDBC by default, and will always specify a very large batch size. And I'm hoping I can also switch over to the BETA of 1.3.0 in maven, as a last resort, if the bulk inserts get too large or use too much CPU.

Here is that 1.3.0

https://mvnrepository.com/artifact/com.microsoft.azure/spark-mssql-connector_2.12

dbeavon · 2024-10-07T23:09:53Z

Oddly the readme file claims there are maven coordinates for 1.3.0 and 1.4.0 but we can't find those in maven yet (only the BETA):

... I'm guessing these are just referring to the beta's that are available here on this community. (listed under releases)

https://github.com/microsoft/sql-spark-connector/releases

At least this project is taking PR's. That is a lot better than nothing.

Hopefully this project will get some TLC when Microsoft is preparing to release the next version of HDInsight.

Rafnel · 2024-10-08T13:17:06Z

Thanks for the detailed response @dbeavon ! Glad to have full confirmation on the RBAR vs bulk inserts with the standard JDBC connector.

We originally chose Spark for .NET because 90% of our stack is .NET and we figured it would be best to follow that convention to reduce the barrier for other devs to use our project. For now we will be sticking with it simply because it's not a high priority to rewrite the project, so maybe Microsoft will revive it before we have time to really consider rewriting it. It is sad that they seem to have quietly abandoned some of these open source projects.

dbeavon · 2024-10-09T01:23:39Z

@Rafnel Where do you host your .Net for Spark?

You might get tripped up when trying to push beyond spark 3.1. I've spent about two days adding just a few lines of code to support 3.3 on the scala and .net side of things. That's how these things go sometimes ;-) But I believe that I'm now past the worst of it.

I really love .Net for Spark. Given that c#.Net is being used for blazer wasm (in a web browser) ... then it can certainly be used on a spark executor as well. I think someone just needs to work with the OSS spark community and discover the real reason why they are objecting to a full parity .Net interface.

When it comes down to looking at the executor interop for java/python/R, there really isn't ton of code here, and there isn't a good reason to leave c#.Net out in the cold.

https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/api/java

... I think it is just politics, and it might require us to simply start nagging at that community until they change their minds. Recently the spark core was re-written by databricks (aka "photon"). Since that happened, I suspect that even the jvm might require the use of an external interop layer ... ie. it is in the exact same boat as C#.Net and python.

RevngeSevnFold · 2024-10-29T20:28:24Z

I see there was a PR done for support to spark 3.5, in September. :( But its seems like its not being reviewed by anyone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New versions? Spark 3.3 and 3.4 - BETA? #265

New versions? Spark 3.3 and 3.4 - BETA? #265

lukasz-kastelik commented Sep 4, 2024

boblannon commented Sep 18, 2024

dbeavon commented Oct 7, 2024

Rafnel commented Oct 7, 2024

dbeavon commented Oct 7, 2024

dbeavon commented Oct 7, 2024

dbeavon commented Oct 7, 2024

Rafnel commented Oct 8, 2024

dbeavon commented Oct 9, 2024

RevngeSevnFold commented Oct 29, 2024

New versions? Spark 3.3 and 3.4 - BETA? #265

New versions? Spark 3.3 and 3.4 - BETA? #265

Comments

lukasz-kastelik commented Sep 4, 2024

boblannon commented Sep 18, 2024

dbeavon commented Oct 7, 2024

Rafnel commented Oct 7, 2024

dbeavon commented Oct 7, 2024

dbeavon commented Oct 7, 2024

dbeavon commented Oct 7, 2024

Rafnel commented Oct 8, 2024

dbeavon commented Oct 9, 2024

RevngeSevnFold commented Oct 29, 2024