-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636
base: master
Are you sure you want to change the base?
Conversation
6158632
to
ef2c14b
Compare
c4cfd72
to
6f752eb
Compare
…should be avoided
6f752eb
to
17a753a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good to me! I have just left some small comments. For the implementation part I think it would be wise to only raise when we know the DF is loaded from a file parquet
, csv
etc... and still keep the exception in place (filtering, usages in visualization tools) to remove FPs.
rules/S7193/python/rule.adoc
Outdated
|
||
For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner. | ||
|
||
If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence starts a bit weirdly. Maybe If the conversion...
would work better?
---- | ||
# Converting a PySpark DataFrame to a Pandas DataFrame | ||
df = spark.read.csv("my_data.csv") | ||
pandas_df = df.toPandas() # Noncompliant: May cause memory issues with large datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have both example to the same thing. Adding here the filtering on the Pandas DataFrame, I think would make sense.
Quality Gate passed for 'rspec-tools'Issues Measures |
Quality Gate passed for 'rspec-frontend'Issues Measures |
You can preview this rule here (updated a few minutes after each push).
Review
A dedicated reviewer checked the rule description successfully for: