SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636

github-actions · 2025-01-30T10:23:50Z

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

logical errors and incorrect information
information gaps and missing content
text style and tone
PR summary and labels follow the guidelines

…should be avoided

joke1196

This looks really good to me! I have just left some small comments. For the implementation part I think it would be wise to only raise when we know the DF is loaded from a file parquet, csv etc... and still keep the exception in place (filtering, usages in visualization tools) to remove FPs.

joke1196 · 2025-01-31T08:32:39Z

rules/S7193/python/rule.adoc

+
+For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner.
+
+If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.


This sentence starts a bit weirdly. Maybe If the conversion... would work better?

joke1196 · 2025-01-31T08:34:52Z

rules/S7193/python/rule.adoc

+----
+# Converting a PySpark DataFrame to a Pandas DataFrame
+df = spark.read.csv("my_data.csv")
+pandas_df = df.toPandas()  # Noncompliant: May cause memory issues with large datasets


It would be nice to have both example to the same thing. Adding here the filtering on the Pandas DataFrame, I think would make sense.

sonarqube-next · 2025-01-31T13:28:42Z

Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

sonarqube-next · 2025-01-31T13:29:05Z

Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

github-actions bot assigned guillaume-dequenne-sonarsource Jan 30, 2025

github-actions bot added the python label Jan 30, 2025

guillaume-dequenne-sonarsource force-pushed the rule/add-RSPEC-S7193 branch from 6158632 to ef2c14b Compare January 30, 2025 11:03

guillaume-dequenne-sonarsource changed the title ~~Create rule S7193~~ SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided Jan 30, 2025

guillaume-dequenne-sonarsource force-pushed the rule/add-RSPEC-S7193 branch 3 times, most recently from c4cfd72 to 6f752eb Compare January 30, 2025 11:17

SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function …

17a753a

…should be avoided

guillaume-dequenne-sonarsource force-pushed the rule/add-RSPEC-S7193 branch from 6f752eb to 17a753a Compare January 30, 2025 14:57

joke1196 approved these changes Jan 31, 2025

View reviewed changes

Fix after review

acb69ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636

SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636

github-actions bot commented Jan 30, 2025

joke1196 left a comment

joke1196 Jan 31, 2025

joke1196 Jan 31, 2025

sonarqube-next bot commented Jan 31, 2025

sonarqube-next bot commented Jan 31, 2025


		For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner.

		If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.

SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636

Are you sure you want to change the base?

SONARPY-2496 Create rule S7193 PySpark DataFrame toPandas function should be avoided #4636

Conversation

github-actions bot commented Jan 30, 2025

Review

joke1196 left a comment

Choose a reason for hiding this comment

joke1196 Jan 31, 2025

Choose a reason for hiding this comment

joke1196 Jan 31, 2025

Choose a reason for hiding this comment

sonarqube-next bot commented Jan 31, 2025

Quality Gate passed for 'rspec-tools'

sonarqube-next bot commented Jan 31, 2025

Quality Gate passed for 'rspec-frontend'