-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Spark Development Plan [OUTDATED]
To find out how to contribute, please join the slack and drop Edwin Chan a message or join the #feat-spark-profiling channel!
- Slack
- Working branch
- Overview of design (To view, comment)
- See Github Projects for open issues! To get an issue assigned, please speak to Edwin on the slack channel!
- Oct 21 - build MVP features - config injection, describe Categorical/TimeStamp
- Oct/Nov 21 - testing and optimisation
- Early Dec 21 - release!
- beta release asap (Q1 2022)
Enable pandas-profiling to use a spark backend in order to profile spark dataframes
- Use native spark opts - As much as possible, use spark.sql and native spark dataframe functions. Do not bring data to local - because if we can, we could just use .toPandas() and apply the usual profile report.
- Do not overload user's spark server - Also, we need to assume that we can only perform read operations on the spark dataframe, and not modifying/write functions so as not to inadvertently crash the user's spark server.
We will try our best to replace all functions that operate on a pandas dataframe to spark functions operating on spark dataframe, without interfering with the rest of the result flow. This ensures that we can retain as much of the config builder, report builder, and visualisation code as possible.
- Build Features
- Configurations
- Column Level Profiling
- Table level
- Optimisations and Testing
- Release, bug fixes, reprioritize tasks
- Pull spark-branch - https://github.com/pandas-profiling/pandas-profiling/tree/spark-branch.
- There is an example.py file in tests/backends/spark_backend/ that should run.
- if you face an import pandas_profiling.utils.str error, you will need to update to the latest version of spark-branch -> https://github.com/pandas-profiling/pandas-profiling/commit/4366508a6d197e88a90f35bb7749447cc44a2bd9
- You will also need to change the sparkmon parameter
weburl = "http://<YOUR.IP.ADDRESS>:4040"
to your own ipaddress (e.g http://125.64.23.10:4040) to use sparkmonitoring tools
Ref design diagram
Configurations (See Configurations)
We need to properly set spark default configurations and inject it during profile time
Description Functions | Progress |
---|---|
Config injection | In progress |
Column Level Profiling (See Column Level Profiling)
A explanation on types : The pandas-profiling libraries uses the visions typing system, which maps spark types to more generic type objects (i.e. spark TimestampType -> visions DateTime type). The full list of visions types can be found here. This enables us to perform generic operations using visions types (without worrying about how it's implemented). The table below describes the status of profiling for native spark types and their respective visions map type.
Spark type | Visions type | Description Functions | Progress |
---|---|---|---|
All types | Unsupported | describe_counts | done |
All types | Unsupported | describe_generic | done |
All types | Unsupported | describe_supported | done |
Spark numeric types | Numeric | describe_numeric_1d | done |
TimestampType | DateTime | describe_date_1d | In progress |
DateType | Nil (not in visions) | Nil - not till supported by visions | Nil |
String types (StringType,VarcharType,Chartype) | Categorical | describe_categorical_1d | In progress |
Nested types (ArrayType,MapType) | Categorical | describe_categorical_1d | In progress |
BinaryType | Categorical | describe_categorical_1d | In progress |
BooleanType | Boolean | nil | done |
Table Level Profiling (See Table Level Profiling)
Description Functions | Progress |
---|---|
Correlations - Spearman | done |
Correlations - Pearson | done |
Correlations - Phi-K | help wanted! |
Correlations - Cramer's V | help wanted! |
Correlations - Kendall's | help wanted! |
Scatterplot | help wanted! |
Table stats | done |
Missing - bar | done |
Missing - dendrogram | help wanted! |
Missing - heatmap | help wanted! |
Missing - matrix | help wanted! |
Sample | done |
Duplicates | done |
Alerts | done |
The multimethod library is the main dispatcher for allowing a function to decide if it should use the pandas or spark implementation of a function. This allows us to abstract the implementation and underlying engine of a function (spark functions vs pandas functions) from how its called (see get_series_descriptions and their respective implementations in pandas and spark)