Skip to content

Commit

Permalink
Merge pull request #568 from pandas-profiling/develop
Browse files Browse the repository at this point in the history
Release v2.9.0
  • Loading branch information
sbrugman authored Sep 3, 2020
2 parents 4cc48f1 + c226d4a commit 7eaea00
Show file tree
Hide file tree
Showing 41 changed files with 929 additions and 56,039 deletions.
14 changes: 8 additions & 6 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,14 @@ env:
- TEST=issue PANDAS="<1"
- TEST=console PANDAS="<1"
- TEST=examples PANDAS="<1"
- TEST=unit PANDAS=">=1"
- TEST=issue PANDAS=">=1"
- TEST=console PANDAS=">=1"
- TEST=examples PANDAS=">=1"
- TEST=lint PANDAS=">=1"
- TEST=typing PANDAS=">=1"
- TEST=unit PANDAS="==1.0.5"
- TEST=issue PANDAS="==1.0.5"
- TEST=unit PANDAS=">=1.1"
- TEST=issue PANDAS=">=1.1"
- TEST=console PANDAS=">=1.1"
- TEST=examples PANDAS=">=1.1"
- TEST=lint PANDAS=">=1.1"
- TEST=typing PANDAS=">=1.1"

before_install:
- pip install --upgrade pip setuptools wheel
Expand Down
74 changes: 37 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,51 +27,34 @@ For each column the following statistics - if relevant for the column type - are

## Announcements

### Version v2.8.0 released
### Version v2.9.0 released

News for users working with image datasets: ``pandas-profiling`` now has build-in supports for Files and Images.
Moreover, the text analysis features have also been reworked, providing more informative statistics.
The release candidate for v2.9.0 was already out for a while, now v2.9.0 is finally released. See the changelog below to know what has changed.

For a better feel, have a look at the [examples](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/examples.html#showcasing-specific-features) section in the docs or read the changelog for a complete view of the changes.
### Spark backend in progress

### Version v2.7.0 released
We can happily announce that we're working on a Spark backend for generating profile reports.
Stay tuned.

#### Performance

There were several performance regressions pointed out to me recently when comparing 1.4.1 to 2.6.0.
To that end, we benchmarked the code and found several minor features introducing disproportionate computational complexity.
Version 2.7.0 optimizes these, giving significant performance improvements!
Moreover, the default configuration is tweaked for towards the needs of the average user.

#### Phased builds and lazy loading

A report is built in phases, which allows for new exciting features such as caching, only re-rendering partial reports and lazily computing the report.
Moreover, the progress bar provides more information on the building phase and step.

#### Documentation

This version introduces [more elaborate documentation](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/index.html) powered by Sphinx. The previously used pdoc3 has been adequate initially, however misses functionality and extensibility. Several recurring topics are now documented, for instance the configuration parameters are documented and there are pages on big datasets, sensitive data, integrations and resources.

#### Support `pandas-profiling`
### Support `pandas-profiling`

The development of ``pandas-profiling`` relies completely on contributions.
If you find value in the package, we welcome you to support the project through [GitHub Sponsors](https://github.com/sponsors/sbrugman)!
It's extra exciting that GitHub **matches your contribution** for the first year.

Find more information here:

- [Changelog v2.7.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-7-0)
- [Changelog v2.8.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-8-0)
- [Changelog v2.9.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-9-0)
- [Sponsor the project on GitHub](https://github.com/sponsors/sbrugman)

*May 7, 2020 💘*
*September 2, 2020 💘*

---

_Contents:_ **[Examples](#examples)** |
**[Installation](#installation)** | **[Documentation](#documentation)** |
**[Large datasets](#large-datasets)** | **[Command line usage](#command-line-usage)** |
**[Advanced usage](#advanced-usage)** |
**[Advanced usage](#advanced-usage)** | **[Support](#supporting-open-source)** |
**[Types](#types)** | **[How to contribute](#contributing)** |
**[Editor Integration](#editor-integration)** | **[Dependencies](#dependencies)**

Expand All @@ -97,7 +80,7 @@ Specific features:
* [Orange prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/united_report.html) and [Coal prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/flatly_report.html) (showcases report themes)

Tutorials:
* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/tutorials/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/kaggle/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Fkaggle%2Fmodify_report_structure.ipynb)
* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/tutorials/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/tutorials/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Ftutorials%2Fmodify_report_structure.ipynb)


## Installation
Expand Down Expand Up @@ -237,19 +220,36 @@ profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram':
profile.to_file("output.html")
```

# Supporting open source

Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible with support of our gracious sponsors.

<table>
<tr>
<td>

<img alt="Lambda Labs" src="https://pandas-profiling.github.io/pandas-profiling/docs/assets/lambda-labs.png" width="500" />

</td>
<td>

[Lambda workstations](https://lambdalabs.com/), servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. [Lambda Cloud](https://lambdalabs.com/service/gpu-cloud) offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN.

</td>
</tr>
</table>

We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:

Martin Sotir, Joseph Yuen, Brian Lee, Stephanie Rivera, nscsekhar, abdulAziz

More info if you would like to appear here: [Github Sponsor page](https://github.com/sponsors/sbrugman)


## Types

Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).
`pandas-profiling` currently recognizes the following types:

- Boolean
- Numerical
- Date
- Categorical
- URL
- Path
- File
- Image
`pandas-profiling` currently recognizes the following types: _Boolean, Numerical, Date, Categorical, URL, Path, File_ and _Image_.

We have developed a type system for Python, tailored for data analysis: [visions](https://github.com/dylan-profiler/visions).
Selecting the right typeset drastically reduces the complexity the code of your analysis.
Expand Down
Binary file added docsrc/assets/lambda-labs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docsrc/source/_static/streamlit-integration.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docsrc/source/pages/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
Changelog
=========

.. include:: changelog/v2_9_0.rst

.. include:: changelog/v2_9_0rc1.rst

.. include:: changelog/v2_8_0.rst
Expand Down
24 changes: 24 additions & 0 deletions docsrc/source/pages/changelog/v2_9_0.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Changelog v2.9.0
----------------

🎉 Features
^^^^^^^^^^^
- Description per variable now possible (see the metadata page) or the Census example.

🐛 Bug fixes
^^^^^^^^^^^^
- Fixed bug for small DataFrames with unused categories.
- Fixed bug where parallelization would have side effects.
- Removed warning where colormap was modified in place.
- Distinguish between unique and distinct correctly.

📖 Documentation
^^^^^^^^^^^^^^^^
- Extend documentation for frequent issues.
- Extended documentation for Streamlit and Panel.
- Provide visibility to our supporters.

⬆️ Dependencies
^^^^^^^^^^^^^^^^^^
- Pandas 1.1.0 contains bugs that make it incompatible. Please up- or downgrade.
- Upgraded visions to 0.5.0.
2 changes: 1 addition & 1 deletion docsrc/source/pages/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ This creates a new conda environment containing the module.

.. hint::

Don't forget to specify the ``conda-forge`` channel. Omitting it won't result in an error, as an outdated package lives on the main channel.
Don't forget to specify the ``conda-forge`` channel. Omitting it won't result in an error, as an outdated package lives on the main channel. See `frequent issues <Support.rst#frequent-issues>`_

Jupyter notebook/lab
--------------------
Expand Down
38 changes: 33 additions & 5 deletions docsrc/source/pages/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,11 +101,37 @@ Ensure to install ``pyqt5``. Via pip use the extras ``app``:
pip install pandas-profiling[app]
Streamlit
~~~~~~~~~

Streamlit / Panel
~~~~~~~~~~~~~~~~~
`Streamlit <https://www.streamlit.io>` is an open-source Python library made to build web-apps for machine learning and data science.

For more information of how to use ``pandas-profiling`` with Streamlit or Panel, see the https://github.com/streamlit/streamlit/issues/693 and https://github.com/pandas-profiling/pandas-profiling/issues/491.
.. image:: ../_static/streamlit-integration.gif

.. code-block:: python
import pandas as pd
import pandas_profiling
import streamlit as st
from streamlit_pandas_profiling import st_profile_report
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
pr = df.profile_report()
st.title("Pandas Profiling in Streamlit")
st.write(df)
st_profile_report(pr)
You can install this `Pandas Profiling component <https://github.com/Ghasel/streamlit-pandas-profiling>` for Streamlit with pip:

.. code-block:: console
pip install streamlit-pandas-profiling
Panel
~~~~~

For more information on how to use ``pandas-profiling`` in Panel, see https://github.com/pandas-profiling/pandas-profiling/issues/491 and the Pandas Profiling example at https://awesome-panel.org.

Cloud Integrations
------------------
Expand Down Expand Up @@ -133,12 +159,14 @@ Kaggle

Pipeline Integrations
---------------------
With the Python, Command-line and Jupyter interfaces, `pandas-profiling` integrates seamlessly with DAG execution tools as Airflow, dagser, Kedro, prefect and any other you can think of.

Integration with `dagser <https://github.com/dagster-io/dagster>`_ or `prefect <https://github.com/prefecthq/prefect>`_ can be achieved in a similar way as Airflow.
With Python, command-line and Jupyter interfaces, `pandas-profiling` integrates seamlessly with DAG execution tools like Airflow, Dagster, Kedro and Prefect.

Integration with `Dagster <https://github.com/dagster-io/dagster>`_ or `Prefect <https://github.com/prefecthq/prefect>`_ can be achieved in a similar way as with Airflow.

Airflow
~~~~~~~

Integration with Airflow can be easily achieved through the `BashOperator <https://airflow.apache.org/docs/stable/_api/airflow/operators/bash_operator/index.html>`_ or the `PythonOperator <https://airflow.apache.org/docs/stable/_api/airflow/operators/python_operator/index.html#airflow.operators.python_operator.PythonOperator>`_.

.. code-block:: python
Expand Down
58 changes: 58 additions & 0 deletions docsrc/source/pages/metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
Metadata
========

Dataset metadata
----------------
When sharing reports with coworkers or publishing online, you might want to include metadata of the dataset, such as author, copyright holder or a description. The supported properties are inspired by `https://schema.org/Dataset <https://schema.org/Dataset>`_. Currently supported are: "description", "creator", "author", "url", "copyright_year", "copyright_holder".

The following example generates a report with a "description", "copyright_holder" and "copyright_year", "creator" and "url".
Expand All @@ -19,3 +21,59 @@ You can find these properties in the "Overview" section under the "About" tab.
),
)
report.to_file(Path("stata_auto_report.html"))
Descriptions per variable
-------------------------
In addition to providing dataset details, users often would like to include column-specific descriptions when sharing reports with team members and stakeholders. This section provides two code examples how to do this in pandas-profiling.

.. code-block:: python
:caption: Generate a report with descriptions per variable
profile = df.profile_report(
variables={
'descriptions':
{
'files': 'Files in the filesystem',
'datec': 'Creation date',
'datem': 'Modification date',
}
)
)
profile.to_file("report.html")
This alternative example demonstrates how you could load the definitions from a json file.
By default, the descriptions are presented in the overview tab and next to each variable.
.. code-block:: json
:caption: dataset_column_definition.json
{
"column name 1": "column 1 definition",
"column name 2": "column 2 definition"
}
.. code-block:: python
:caption: Generate a report with descriptions per variable from a definitions file
import json
import pandas as pd
import pandas_profiling
definition_file = 'dataset_column_definition.json'
# Read the variable descriptions
with open(definition_file, 'r') as f:
definitions = json.load(f)
# By default, the descriptions are presented in the overview tab and next to each variable
report = df.profile_report(variable=dict(descriptions=definitions))
# We can disable showing the descriptions next to each variable
report = df.profile_report(
variable=dict(descriptions=definitions),
show_variable_description=False
)
report.to_file('report.html')
15 changes: 13 additions & 2 deletions docsrc/source/pages/support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,20 @@ First, we need to know whether a problem is actually a bug in the code, or that
Frequent issues
~~~~~~~~~~~~~~~

- This thread discusses `conda installing older versions <https://github.com/conda-forge/pandas-profiling-feedstock/issues/22>`_ of the package.
Conda install defaults to v1.4.1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- When in a Jupyter environment, you see some text, such as ``IntSlider(value=0)`` or interactive ``(children=(IntSlider(value=0, description='x', max=1), Output()), _dom_classes=('widget-interact',))``, then the Jupyter Widgets are not activated. The :doc:`installation` page contains instructions on how to resolve this problem.
Some users experience that ``conda install -c conda-forge pandas-profiling`` defaults to 1.4.1.

More details, `here <https://github.com/conda-forge/pandas-profiling-feedstock/issues/22>`_, `here <https://github.com/pandas-profiling/pandas-profiling/issues/448>`__ and `here <https://github.com/pandas-profiling/pandas-profiling/issues/563>`__.

Jupyter "IntSlider(value=0)"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When in a Jupyter environment, you see some text, such as ``IntSlider(value=0)`` or interactive ``(children=(IntSlider(value=0, description='x', max=1), Output()), _dom_classes=('widget-interact',))``, then the Jupyter Widgets are not activated. The :doc:`installation` page contains instructions on how to resolve this problem.


Help on Stackoverflow
---------------------

Users with a request for help on how to use `pandas-profiling` should consider asking their question on stackoverflow. There is a specific tag for `pandas-profiling`:

Expand Down
20 changes: 20 additions & 0 deletions examples/census/census.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import json
from pathlib import Path

import numpy as np
Expand Down Expand Up @@ -38,5 +39,24 @@
# Prepare missing values
df = df.replace("\\?", np.nan, regex=True)

# Initialize the report
profile = ProfileReport(df, title="Census Dataset", explorative=True)

# show column definition
definitions = json.load(open(f"census_column_definition.json"))
profile.set_variable(
"dataset",
{
"description": 'Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year.',
"copyright_year": "1996",
"author": "Ronny Kohavi and Barry Becker",
"creator": "Barry Becker",
"url": "https://archive.ics.uci.edu/ml/datasets/adult",
},
)
profile.set_variable("variables.descriptions", definitions)

# Only show the descriptions in the overview
profile.set_variable("show_variable_description", False)

profile.to_file(Path("./census_report.html"))
16 changes: 16 additions & 0 deletions examples/census/census_column_definition.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"age": "definition 0",
"workclass": "definition 1",
"fnlwgt": "definition 2",
"education": "definition 3",
"education-num": "definition 4",
"marital-status": "definition 5",
"occupation": "definition 6",
"relationship": "definition 7",
"race": "definition 8",
"sex": "definition 9",
"capital-gain": "definition 10",
"capital-loss": "definition 11",
"hours-per-week": "definition 12",
"native-country": "definition 13"
}
Loading

0 comments on commit 7eaea00

Please sign in to comment.