Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

  • logical errors and incorrect information
  • information gaps and missing content
  • text style and tone
  • PR summary and labels follow the guidelines

@guillaume-dequenne-sonarsource guillaume-dequenne-sonarsource changed the title Create rule S7193 SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided Jan 30, 2025
@guillaume-dequenne-sonarsource guillaume-dequenne-sonarsource force-pushed the rule/add-RSPEC-S7193 branch 3 times, most recently from c4cfd72 to 6f752eb Compare January 30, 2025 11:17
Copy link
Contributor

@joke1196 joke1196 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me! I have just left some small comments. For the implementation part I think it would be wise to only raise when we know the DF is loaded from a file parquet, csv etc... and still keep the exception in place (filtering, usages in visualization tools) to remove FPs.


For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner.

If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence starts a bit weirdly. Maybe If the conversion... would work better?

----
# Converting a PySpark DataFrame to a Pandas DataFrame
df = spark.read.csv("my_data.csv")
pandas_df = df.toPandas() # Noncompliant: May cause memory issues with large datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have both example to the same thing. Adding here the filtering on the Pandas DataFrame, I think would make sense.

Copy link

Quality Gate passed Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Copy link

Quality Gate passed Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants