Spark Development Plan [OUTDATED]

To find out how to contribute, please join the slack and drop Edwin Chan a message or join the #feat-spark-profiling channel!

Slack
Working branch
Overview of design (To view, comment)
See Github Projects for open issues! To get an issue assigned, please speak to Edwin on the slack channel!

Rough Timeline (anddd this is outdated)

Oct 21 - build MVP features - config injection, describe Categorical/TimeStamp
Oct/Nov 21 - testing and optimisation
Early Dec 21 - release!

New Timeline

beta release asap (Q1 2022)

Objective

Enable pandas-profiling to use a spark backend in order to profile spark dataframes

General Constraints

Use native spark opts - As much as possible, use spark.sql and native spark dataframe functions. Do not bring data to local - because if we can, we could just use .toPandas() and apply the usual profile report.
Do not overload user's spark server - Also, we need to assume that we can only perform read operations on the spark dataframe, and not modifying/write functions so as not to inadvertently crash the user's spark server.

Broad implementation goal

We will try our best to replace all functions that operate on a pandas dataframe to spark functions operating on spark dataframe, without interfering with the rest of the result flow. This ensures that we can retain as much of the config builder, report builder, and visualisation code as possible.

Implementation Strategy and Outline

Build Features

Configurations
Column Level Profiling
Table level

Optimisations and Testing
Release, bug fixes, reprioritize tasks

Getting Started

Pull spark-branch - https://github.com/pandas-profiling/pandas-profiling/tree/spark-branch.
There is an example.py file in tests/backends/spark_backend/ that should run.
if you face an import pandas_profiling.utils.str error, you will need to update to the latest version of spark-branch -> https://github.com/pandas-profiling/pandas-profiling/commit/4366508a6d197e88a90f35bb7749447cc44a2bd9
You will also need to change the sparkmon parameter weburl = "http://<YOUR.IP.ADDRESS>:4040" to your own ipaddress (e.g http://125.64.23.10:4040) to use sparkmonitoring tools

Design Overview

Ref design diagram

Implementation Details

Configurations (See Configurations)

We need to properly set spark default configurations and inject it during profile time

Description Functions	Progress
Config injection	In progress

Column Level Profiling (See Column Level Profiling)

A explanation on types : The pandas-profiling libraries uses the visions typing system, which maps spark types to more generic type objects (i.e. spark TimestampType -> visions DateTime type). The full list of visions types can be found here. This enables us to perform generic operations using visions types (without worrying about how it's implemented). The table below describes the status of profiling for native spark types and their respective visions map type.

Spark type	Visions type	Description Functions	Progress
All types	Unsupported	describe_counts	done
All types	Unsupported	describe_generic	done
All types	Unsupported	describe_supported	done
Spark numeric types	Numeric	describe_numeric_1d	done
TimestampType	DateTime	describe_date_1d	In progress
DateType	Nil (not in visions)	Nil - not till supported by visions	Nil
String types (StringType,VarcharType,Chartype)	Categorical	describe_categorical_1d	In progress
Nested types (ArrayType,MapType)	Categorical	describe_categorical_1d	In progress
BinaryType	Categorical	describe_categorical_1d	In progress
BooleanType	Boolean	nil	done

Table Level Profiling (See Table Level Profiling)

Description Functions	Progress
Correlations - Spearman	done
Correlations - Pearson	done
Correlations - Phi-K	help wanted!
Correlations - Cramer's V	help wanted!
Correlations - Kendall's	help wanted!
Scatterplot	help wanted!
Table stats	done
Missing - bar	done
Missing - dendrogram	help wanted!
Missing - heatmap	help wanted!
Missing - matrix	help wanted!
Sample	done
Duplicates	done
Alerts	done

Specific software patterns

multimethod - backend multi-dispatcher/switcher

The multimethod library is the main dispatcher for allowing a function to decide if it should use the pandas or spark implementation of a function. This allows us to abstract the implementation and underlying engine of a function (spark functions vs pandas functions) from how its called (see get_series_descriptions and their respective implementations in pandas and spark)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly