Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Add Support for Snowpark #1419

Open
rwhitten577 opened this issue Nov 21, 2024 · 1 comment
Open

[Enh]: Add Support for Snowpark #1419

rwhitten577 opened this issue Nov 21, 2024 · 1 comment
Labels
blocked enhancement New feature or request

Comments

@rwhitten577
Copy link

rwhitten577 commented Nov 21, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

I'm building a platform for feature engineering across several contexts (batch, stream, on-demand). Today using pandas for on-demand and often Snowpark for batch, but I need to run large backfills of the on-demand feature logic. I am doing the heavy lifting (joins & agg) from data sources in Snowpark before pulling into pandas, but still limited by RAM and switching between 2 APIs within the platform. Supporting Snowpark in narwhals would allow using the same expressions for both on-demand and backfills, and keep the platform's internal code much simpler by removing all the conditional checks between Snowpark & pandas.

Please describe the purpose of the new feature or describe the problem to solve.

I've been following the progress in #333 and would love to see Snowpark support added, building upon the PySpark implementation. Snowpark follows the PySpark API so hopefully can leverage all the great work done adding PySpark.

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

Ibis promises to do this, but has high overhead when converting from a native pyarrow table or pandas df to an Ibis table, especially when you have 1000+ thousand columns. It may take 10s just to create one Ibis table, when I need the feature calculations completed in <50ms. Narwhals seems like it may fit by using the supplied native df directly with minimal overhead.

Briefly tried DuckDB's Spark API but it too had high overhead when mapping from a large pandas df to a DuckDB dataframe.

Additional information that may help us understand your needs.

No response

@MarcoGorelli
Copy link
Member

thanks @rwhitten577 for the request

sure, if it follows the pyspark api then it shouldn't be a major lift once we get pyspark in

@MarcoGorelli MarcoGorelli added enhancement New feature or request blocked labels Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants