Merge pull request #568 from pandas-profiling/develop

Release v2.9.0
ydataai · Sep 3, 2020 · 7eaea00 · 7eaea00
2 parents 4cc48f1 + c226d4a
commit 7eaea00
Show file tree

Hide file tree

Showing 41 changed files with 929 additions and 56,039 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -28,12 +28,14 @@ env:
   - TEST=issue PANDAS="<1"
   - TEST=console PANDAS="<1"
   - TEST=examples PANDAS="<1"
-  - TEST=unit PANDAS=">=1"
-  - TEST=issue PANDAS=">=1"
-  - TEST=console PANDAS=">=1"
-  - TEST=examples PANDAS=">=1"
-  - TEST=lint PANDAS=">=1"
-  - TEST=typing PANDAS=">=1"
+  - TEST=unit PANDAS="==1.0.5"
+  - TEST=issue PANDAS="==1.0.5"
+  - TEST=unit PANDAS=">=1.1"
+  - TEST=issue PANDAS=">=1.1"
+  - TEST=console PANDAS=">=1.1"
+  - TEST=examples PANDAS=">=1.1"
+  - TEST=lint PANDAS=">=1.1"
+  - TEST=typing PANDAS=">=1.1"
 
 before_install:
   - pip install --upgrade pip setuptools wheel

diff --git a/README.md b/README.md
@@ -27,51 +27,34 @@ For each column the following statistics - if relevant for the column type - are
 
 ## Announcements
 
-### Version v2.8.0 released
+### Version v2.9.0 released
 
-News for users working with image datasets: ``pandas-profiling`` now has build-in supports for Files and Images.
-Moreover, the text analysis features have also been reworked, providing more informative statistics.
+The release candidate for v2.9.0 was already out for a while, now v2.9.0 is finally released. See the changelog below to know what has changed.
 
-For a better feel, have a look at the [examples](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/examples.html#showcasing-specific-features) section in the docs or read the changelog for a complete view of the changes.
+### Spark backend in progress
 
-### Version v2.7.0 released
+We can happily announce that we're working on a Spark backend for generating profile reports.
+Stay tuned.
 
-#### Performance
-
-There were several performance regressions pointed out to me recently when comparing 1.4.1 to 2.6.0.
-To that end, we benchmarked the code and found several minor features introducing disproportionate computational complexity.
-Version 2.7.0 optimizes these, giving significant performance improvements!
-Moreover, the default configuration is tweaked for towards the needs of the average user.
-
-#### Phased builds and lazy loading
-
-A report is built in phases, which allows for new exciting features such as caching, only re-rendering partial reports and lazily computing the report.
-Moreover, the progress bar provides more information on the building phase and step.
-
-#### Documentation
-
-This version introduces [more elaborate documentation](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/index.html) powered by Sphinx. The previously used pdoc3 has been adequate initially, however misses functionality and extensibility. Several recurring topics are now documented, for instance the configuration parameters are documented and there are pages on big datasets, sensitive data, integrations and resources.
-
-#### Support `pandas-profiling`
+### Support `pandas-profiling`
 
 The development of ``pandas-profiling`` relies completely on contributions.
 If you find value in the package, we welcome you to support the project through [GitHub Sponsors](https://github.com/sponsors/sbrugman)!
 It's extra exciting that GitHub **matches your contribution** for the first year.
 
 Find more information here:
 
- - [Changelog v2.7.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-7-0)
- - [Changelog v2.8.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-8-0)
+ - [Changelog v2.9.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-9-0)
  - [Sponsor the project on GitHub](https://github.com/sponsors/sbrugman)
 
- *May 7, 2020 💘*
+ *September 2, 2020 💘*
 
 ---
 
 _Contents:_ **[Examples](#examples)** |
 **[Installation](#installation)** | **[Documentation](#documentation)** |
 **[Large datasets](#large-datasets)** | **[Command line usage](#command-line-usage)** |
-**[Advanced usage](#advanced-usage)** |
+**[Advanced usage](#advanced-usage)** | **[Support](#supporting-open-source)** |
 **[Types](#types)** | **[How to contribute](#contributing)** |
 **[Editor Integration](#editor-integration)** | **[Dependencies](#dependencies)**
 
@@ -97,7 +80,7 @@ Specific features:
 * [Orange prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/united_report.html) and [Coal prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/flatly_report.html) (showcases report themes)
 
 Tutorials:
-* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/tutorials/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/kaggle/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Fkaggle%2Fmodify_report_structure.ipynb)
+* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/tutorials/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/tutorials/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Ftutorials%2Fmodify_report_structure.ipynb)
 
 
 ## Installation
@@ -237,19 +220,36 @@ profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram':
 profile.to_file("output.html")
 ```
 
+# Supporting open source
+
+Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible with support of our gracious sponsors.
+
+<table>
+<tr>
+<td>
+
+<img alt="Lambda Labs" src="https://pandas-profiling.github.io/pandas-profiling/docs/assets/lambda-labs.png" width="500" />
+
+</td>
+<td>
+
+[Lambda workstations](https://lambdalabs.com/), servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. [Lambda Cloud](https://lambdalabs.com/service/gpu-cloud) offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN.
+
+</td>
+</tr>
+</table>
+
+We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible: 
+
+    Martin Sotir, Joseph Yuen, Brian Lee, Stephanie Rivera, nscsekhar, abdulAziz
+
+More info if you would like to appear here: [Github Sponsor page](https://github.com/sponsors/sbrugman)
+
+
 ## Types
 
 Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).
-`pandas-profiling` currently recognizes the following types:
-
-- Boolean
-- Numerical
-- Date
-- Categorical
-- URL
-- Path
-- File
-- Image
+`pandas-profiling` currently recognizes the following types: _Boolean, Numerical, Date, Categorical, URL, Path, File_ and _Image_.
 
 We have developed a type system for Python, tailored for data analysis: [visions](https://github.com/dylan-profiler/visions).
 Selecting the right typeset drastically reduces the complexity the code of your analysis.

diff --git a/docsrc/assets/lambda-labs.png b/docsrc/assets/lambda-labs.png
diff --git a/docsrc/source/_static/streamlit-integration.gif b/docsrc/source/_static/streamlit-integration.gif
diff --git a/docsrc/source/pages/changelog.rst b/docsrc/source/pages/changelog.rst
@@ -2,6 +2,8 @@
 Changelog
 =========
 
+.. include:: changelog/v2_9_0.rst
+
 .. include:: changelog/v2_9_0rc1.rst
 
 .. include:: changelog/v2_8_0.rst

diff --git a/docsrc/source/pages/changelog/v2_9_0.rst b/docsrc/source/pages/changelog/v2_9_0.rst
@@ -0,0 +1,24 @@
+Changelog v2.9.0
+----------------
+
+🎉 Features
+^^^^^^^^^^^
+- Description per variable now possible (see the metadata page) or the Census example.
+
+🐛 Bug fixes
+^^^^^^^^^^^^
+- Fixed bug for small DataFrames with unused categories.
+- Fixed bug where parallelization would have side effects.
+- Removed warning where colormap was modified in place.
+- Distinguish between unique and distinct correctly.
+
+📖 Documentation
+^^^^^^^^^^^^^^^^
+- Extend documentation for frequent issues.
+- Extended documentation for Streamlit and Panel.
+- Provide visibility to our supporters.
+
+⬆️ Dependencies
+^^^^^^^^^^^^^^^^^^
+- Pandas 1.1.0 contains bugs that make it incompatible. Please up- or downgrade.
+- Upgraded visions to 0.5.0.
diff --git a/docsrc/source/pages/installation.rst b/docsrc/source/pages/installation.rst
@@ -57,7 +57,7 @@ This creates a new conda environment containing the module.
 
 .. hint::
 
-        Don't forget to specify the ``conda-forge`` channel. Omitting it won't result in an error, as an outdated package lives on the main channel.
+        Don't forget to specify the ``conda-forge`` channel. Omitting it won't result in an error, as an outdated package lives on the main channel. See `frequent issues <Support.rst#frequent-issues>`_
 
 Jupyter notebook/lab
 --------------------

diff --git a/docsrc/source/pages/integrations.rst b/docsrc/source/pages/integrations.rst
@@ -101,11 +101,37 @@ Ensure to install ``pyqt5``. Via pip use the extras ``app``:
 
   pip install pandas-profiling[app]
 
+Streamlit
+~~~~~~~~~
 
-Streamlit / Panel
-~~~~~~~~~~~~~~~~~
+`Streamlit <https://www.streamlit.io>` is an open-source Python library made to build web-apps for machine learning and data science.
 
-For more information of how to use ``pandas-profiling`` with Streamlit or Panel, see the https://github.com/streamlit/streamlit/issues/693 and https://github.com/pandas-profiling/pandas-profiling/issues/491.
+.. image:: ../_static/streamlit-integration.gif
+
+.. code-block:: python
+
+  import pandas as pd
+  import pandas_profiling
+  import streamlit as st
+  from streamlit_pandas_profiling import st_profile_report
+
+  df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
+  pr = df.profile_report()
+
+  st.title("Pandas Profiling in Streamlit")
+  st.write(df)
+  st_profile_report(pr)
+
+You can install this `Pandas Profiling component <https://github.com/Ghasel/streamlit-pandas-profiling>` for Streamlit with pip:
+
+.. code-block:: console
+
+  pip install streamlit-pandas-profiling
+
+Panel
+~~~~~
+
+For more information on how to use ``pandas-profiling`` in Panel, see https://github.com/pandas-profiling/pandas-profiling/issues/491 and the Pandas Profiling example at https://awesome-panel.org.
 
 Cloud Integrations
 ------------------
@@ -133,12 +159,14 @@ Kaggle
 
 Pipeline Integrations
 ---------------------
-With the Python, Command-line and Jupyter interfaces, `pandas-profiling` integrates seamlessly with DAG execution tools as Airflow, dagser, Kedro, prefect and any other you can think of.
 
-Integration with `dagser <https://github.com/dagster-io/dagster>`_ or `prefect <https://github.com/prefecthq/prefect>`_ can be achieved in a similar way as Airflow.
+With Python, command-line and Jupyter interfaces, `pandas-profiling` integrates seamlessly with DAG execution tools like Airflow, Dagster, Kedro and Prefect.
+
+Integration with `Dagster <https://github.com/dagster-io/dagster>`_ or `Prefect <https://github.com/prefecthq/prefect>`_ can be achieved in a similar way as with Airflow.
 
 Airflow
 ~~~~~~~
+
 Integration with Airflow can be easily achieved through the `BashOperator <https://airflow.apache.org/docs/stable/_api/airflow/operators/bash_operator/index.html>`_ or the `PythonOperator <https://airflow.apache.org/docs/stable/_api/airflow/operators/python_operator/index.html#airflow.operators.python_operator.PythonOperator>`_.
 
 .. code-block:: python

diff --git a/docsrc/source/pages/metadata.rst b/docsrc/source/pages/metadata.rst
@@ -2,6 +2,8 @@
 Metadata
 ========
 
+Dataset metadata
+----------------
 When sharing reports with coworkers or publishing online, you might want to include metadata of the dataset, such as author, copyright holder or a description. The supported properties are inspired by `https://schema.org/Dataset <https://schema.org/Dataset>`_. Currently supported are: "description", "creator", "author", "url", "copyright_year", "copyright_holder".
 
 The following example generates a report with a "description", "copyright_holder" and "copyright_year", "creator" and "url".
@@ -19,3 +21,59 @@ You can find these properties in the "Overview" section under the "About" tab.
         ),
     )
     report.to_file(Path("stata_auto_report.html"))
+
+Descriptions per variable
+-------------------------
+In addition to providing dataset details, users often would like to include column-specific descriptions when sharing reports with team members and stakeholders. This section provides two code examples how to do this in pandas-profiling.
+
+.. code-block:: python
+        :caption: Generate a report with descriptions per variable
+
+        profile = df.profile_report(
+                variables={
+                        'descriptions':
+                        {
+                              'files': 'Files in the filesystem',
+                              'datec': 'Creation date',
+                              'datem': 'Modification date',
+                        }
+                )
+        )
+
+        profile.to_file("report.html")
+
+
+This alternative example demonstrates how you could load the definitions from a json file.
+By default, the descriptions are presented in the overview tab and next to each variable.
+
+.. code-block:: json
+     :caption: dataset_column_definition.json
+
+        {
+            "column name 1": "column 1 definition",
+            "column name 2": "column 2 definition"
+        }
+
+.. code-block:: python
+        :caption: Generate a report with descriptions per variable from a definitions file
+
+        import json
+        import pandas as pd
+        import pandas_profiling
+
+        definition_file = 'dataset_column_definition.json'
+
+        # Read the variable descriptions
+        with open(definition_file, 'r') as f:
+            definitions = json.load(f)
+
+        # By default, the descriptions are presented in the overview tab and next to each variable
+        report = df.profile_report(variable=dict(descriptions=definitions))
+
+        # We can disable showing the descriptions next to each variable
+        report = df.profile_report(
+                variable=dict(descriptions=definitions),
+                show_variable_description=False
+        )
+
+        report.to_file('report.html')
diff --git a/docsrc/source/pages/support.rst b/docsrc/source/pages/support.rst
@@ -10,9 +10,20 @@ First, we need to know whether a problem is actually a bug in the code, or that
 Frequent issues
 ~~~~~~~~~~~~~~~
 
-- This thread discusses `conda installing older versions <https://github.com/conda-forge/pandas-profiling-feedstock/issues/22>`_ of the package.
+Conda install defaults to v1.4.1
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-- When in a Jupyter environment, you see some text, such as ``IntSlider(value=0)`` or interactive ``(children=(IntSlider(value=0, description='x', max=1), Output()), _dom_classes=('widget-interact',))``, then the Jupyter Widgets are not activated. The :doc:`installation` page contains instructions on how to resolve this problem.
+Some users experience that ``conda install -c conda-forge pandas-profiling`` defaults to 1.4.1.
+
+More details, `here <https://github.com/conda-forge/pandas-profiling-feedstock/issues/22>`_, `here <https://github.com/pandas-profiling/pandas-profiling/issues/448>`__ and `here <https://github.com/pandas-profiling/pandas-profiling/issues/563>`__.
+
+Jupyter "IntSlider(value=0)"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+When in a Jupyter environment, you see some text, such as ``IntSlider(value=0)`` or interactive ``(children=(IntSlider(value=0, description='x', max=1), Output()), _dom_classes=('widget-interact',))``, then the Jupyter Widgets are not activated. The :doc:`installation` page contains instructions on how to resolve this problem.
+
+
+Help on Stackoverflow
+---------------------
 
 Users with a request for help on how to use `pandas-profiling` should consider asking their question on stackoverflow. There is a specific tag for `pandas-profiling`:
 

diff --git a/examples/census/census.py b/examples/census/census.py
@@ -1,3 +1,4 @@
+import json
 from pathlib import Path
 
 import numpy as np
@@ -38,5 +39,24 @@
     # Prepare missing values
     df = df.replace("\\?", np.nan, regex=True)
 
+    # Initialize the report
     profile = ProfileReport(df, title="Census Dataset", explorative=True)
+
+    # show column definition
+    definitions = json.load(open(f"census_column_definition.json"))
+    profile.set_variable(
+        "dataset",
+        {
+            "description": 'Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year.',
+            "copyright_year": "1996",
+            "author": "Ronny Kohavi and Barry Becker",
+            "creator": "Barry Becker",
+            "url": "https://archive.ics.uci.edu/ml/datasets/adult",
+        },
+    )
+    profile.set_variable("variables.descriptions", definitions)
+
+    # Only show the descriptions in the overview
+    profile.set_variable("show_variable_description", False)
+
     profile.to_file(Path("./census_report.html"))
diff --git a/examples/census/census_column_definition.json b/examples/census/census_column_definition.json
@@ -0,0 +1,16 @@
+{
+    "age": "definition 0",
+    "workclass": "definition 1",
+    "fnlwgt": "definition 2",
+    "education": "definition 3",
+    "education-num": "definition 4",
+    "marital-status": "definition 5",
+    "occupation": "definition 6",
+    "relationship": "definition 7",
+    "race": "definition 8",
+    "sex": "definition 9",
+    "capital-gain": "definition 10",
+    "capital-loss": "definition 11",
+    "hours-per-week": "definition 12",
+    "native-country": "definition 13"
+}