Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1563484 Remove Advanced Query Pushdown Feature #572

Merged
merged 9 commits into from
Jul 30, 2024
Merged

Conversation

sfc-gh-bli
Copy link
Contributor

Remove Advanced Query Pushdown Feature

@sfc-gh-bli sfc-gh-bli requested review from sfc-gh-yuwang and removed request for sfc-gh-ema July 29, 2024 23:13
@sfc-gh-bli sfc-gh-bli merged commit 8350590 into master Jul 30, 2024
26 checks passed
@sfc-gh-bli sfc-gh-bli deleted the snow-1563484 branch July 30, 2024 21:54
@jalpan-randeri
Copy link

@sfc-gh-bli , @sfc-gh-yuwang

Can you share some insights why advanced pushdown was removed ?
It is creating conflict with the PR - #535

That was open and pending review Since Nov. 2023. Despite requesting multiple time reviews.

How does this change will improve performance ?

@satyanash
Copy link

satyanash commented Aug 2, 2024

Removed the Advanced Query Pushdown feature

@sfc-gh-bli @sfc-gh-yuwang This is quite unexpected! What is the justification for removing this feature?

A conversion tool which can convert DataFrames between Spark and Snowpark will be introduced in the future Spark connector release soon. It will be an alternative of Advanced Query Pushdown feature.

It is unclear if this will work from any spark cluster using the spark-snowflake connector or will it only work on snowpark.

It would be great if there is clarification on the timeline and how "soon" this will be available.

@parthshyara
Copy link

@sfc-gh-bli @sfc-gh-yuwang
We are relying heavily on the spark-snowflake connector, hence the push for Spark 3.5 Support but the removal of Advanced Query Pushdown is quite a surprise. Can you please share the details on why was this feature removed and what is the plan on alternative? The update on the releases page is quite ambiguous and does not provide clarity.

@sfc-gh-bli
Copy link
Contributor Author

@sfc-gh-bli , @sfc-gh-yuwang

Can you share some insights why advanced pushdown was removed ? It is creating conflict with the PR - #535

That was open and pending review Since Nov. 2023. Despite requesting multiple time reviews.

How does this change will improve performance ?

We decided to remove Advanced query pushdown feature because:

  1. The coverage of Advanced Query Pushdown feature has been declined in past years.
    Not all Spark logical plans can be converted to Snowflake SQL queries. The Advanced Query Pushdown feature worked well when it was introduced about 7 years ago. At that time, it could convert most Spark DataFrame operations to Snowflake SQL. However, Spark has introduced many internal changes in the past years. The coverage of Advanced Query Pushdown has declined. This issue is more serious in the Spark 3.5.
  2. The behavior is unpredictable.
    The behavior of Spark and Snowflake are not exactly same. With Advanced Query Pushdown, user can't predict if the operation will be processed in the Spark, Snowflake, or mix. A small change of DataFrame operation may result in the query can't be pushed down to Snowflake.
  3. One connector artifact may be only compatible with one Spark release.
    To support multiple Spark releases, we have to generate multiple connector artifacts. We only support the latest three major releases of Spark. However, the case it more and more complicated. for example, the artifact compiled with Spark 3.4.0 doesn't work with Spark 3.4.1. Spark changed logical plan design even in the patch release.

The improvement of removal of Advanced Query Pushdown feature

  1. one artifact can be compatible with multiple Spark release. When Spark release a new version, for example, 3.6.0, the user doesn't have to wait for support. Or, if the user still uses a very old version of Spark, for example 2.1.0, they can continue to upgrade the connector in the most cases.
  2. Users can choose where to process DataFrame operators now, in Spark or Snowflake. Some operators have different behaviors in Spark and Snowflake. About the performance, some operators are faster in Snowflake, at the same time, some are faster in Spark. With the new Snowpark conversion tool, or create Dataframe from query, users can decide which operators should be processed in the Snowflake, and which should be processed in the Spark.

@sfc-gh-bli
Copy link
Contributor Author

Removed the Advanced Query Pushdown feature

@sfc-gh-bli @sfc-gh-yuwang This is quite unexpected! What is the justification for removing this feature?

A conversion tool which can convert DataFrames between Spark and Snowpark will be introduced in the future Spark connector release soon. It will be an alternative of Advanced Query Pushdown feature.

It is unclear if this will work from any spark cluster using the spark-snowflak connector or will it only work on snowpark.

It would be great if there is clarification on the timeline and how "soon" this will be available.

The conversion tool should works with any Spark cluster where the Spark connector works now. It pretty similar to the Advanced Query Pushdown feature. for example, loading data from Snowflake to Spark.
Without Advanced Query Pushdown

val df = spark.read.format("snowflake").options(...).load()
df.select(...).filter(...).union(...).join(...).collect() // connector will try to push down this operators but not guaranteed 

with conversion tool

val snowparkDataFrame = snowpark.table(...).select(...).filter(...).union(...).join(...) // all of these operators will be processed in Snowflake.
val sparkDataFrame = toSpark(snowparkDataFrame, sparkSession)
// all operations on sparkDataFrame will be processed in Spark cluster.

Unlike Advanced Query Pushdown, the new conversion tool also support Spark to Snowpark conversion, for example

val sparkDataFrame = ...
val snowparkDataFrame = toSnowpark(sparkDataFrame, snowparkSession) // all operators on snowparkDataframe will be processed in Snowflake.

how "soon" this will be available.

We are working on it now. It will be available in September, the connector 3.1.0.

@sfc-gh-bli
Copy link
Contributor Author

sfc-gh-bli commented Aug 5, 2024

@sfc-gh-bli @sfc-gh-yuwang We are relying heavily on the spark-snowflake connector, hence the push for Spark 3.5 Support but the removal of Advanced Query Pushdown is quite a surprise. Can you please share the details on why was this feature removed and what is the plan on alternative? The update on the releases page is quite ambiguous and does not provide clarity.

In the development of Spark 3.5 support, we saw may internal changes of Spark logical plan and internal row system, which significantly declined the coverage of Advanced Query Pushdown. We also saw some wrong results due to the change of internal row system.
It is a long discussion to remove the Advanced Query Pushdown. To speed up the Spark 3.5 and feature Spark release support, we finally decided to remove this feature.
We will continue to support connector 2.x.x for up to two years, which still has Advanced Query Pushdown feature. The branch of 2.x.x is https://github.com/snowflakedb/spark-snowflake/tree/v2_master
However, it is only compatible with Spark 3.2, 3.3, and 3.4.0 (not 3.4.1).

There are two alternatives of Advanced Query Pushdown.
1, instead of directly loading DataFrame from dbtable, loading from query. Those SQL queries will be processed in the Snowflake. So if you use query more than dbtable in your workload, Advanced Query Pushdown may be a useless feature in your case.
2, Using Snowpark and Spark conversion tool. it will be introduced in the connector 3.1.0. You can build a Snowpark DataFrame first, and then convert to Spark DataFrame. The operations of Snowpark DataFrame are always processed on the Snowflake side.

@parthshyara
Copy link

@sfc-gh-bli We are still waiting on the new release of the connector with snowpark integration, to evaluate if we can use it. Can you please help with some issue / PR where the progress is being tracked? The initial estimation for the same was September.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants