Skip to content

Commit

Permalink
docs: add information about pii classification feature (#1517)
Browse files Browse the repository at this point in the history
* Update duplicates_pandas.py (#1427)

Fixing Bug Report #1384
Dataset with categorical features causes memory error even on tiny dataset.

* chore(actions): update sonarsource/sonarqube-scan-action action to v2.0.1

* chore(actions): update actions/checkout action to v4

* docs: setup new docs with mkdocs (#1418)

* chore(actions): update actions/checkout action to v4

* fix: remove the duplicated cardinality threshold under categorical and text settings

* fix: fixate matplotlib upper version

* docs: change from `zap` to `sparkles` (#1447)

Co-authored-by: Fabiana <[email protected]>

* fix: template {{ file_name }} error in HTML wrapper (#1380)

* Update javascript.html

* Update style.html

* feat: add density histogram (#1458)

* feat: add histogram density option

* test: add unit test

* fix: discard weights if exceed max_bins

* docs: update README.html (#1461)

Update url of use cases, main integrations, and common issues.

* fix: bug when creating a new report (#1440)

* fix: gen wordcloud only for non-empty cols (#1459)

* fix: table template ignoring text format (#1462)

* fix: table template ignoring text format

* fix: timeseries unit test

* fix(linting): code formatting

---------

Co-authored-by: Azory YData Bot <[email protected]>

* fix: to_category misshandling pd.NA (#1464)

* docs: add 📊 for Key features (#1451)

See also #1445 (comment)

* docs: fix hyperlink - related to package name change (#1457)

Co-authored-by: Martin Mokry <[email protected]>

* chore(deps): increase numpy upper limit (#1467)

* chore(deps): increase numpy upper limit

* chore(deps): fixate numpy version for spark

* chore(deps): fix numba package version, and filter warns (#1468)

* chore: fix numba package version, and filter warns

* fix: skip isort linter on init

* chore(deps): update dependency typeguard to v4 (#1324)

* chore(deps): update dependency typeguard to v4

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Maciej Bukczynski <[email protected]>

* docs: update docs with advent of code

* docs: update links for fabric

* chore(actions): update actions/setup-python action to v5

* docs: add information about PII classification & management.

---------

Co-authored-by: boris-kogan <[email protected]>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Vasco Ramos <[email protected]>
Co-authored-by: ricardodcpereira <[email protected]>
Co-authored-by: Anselm Hahn <[email protected]>
Co-authored-by: Joge <[email protected]>
Co-authored-by: Alex Barros <[email protected]>
Co-authored-by: Miriam Seoane Santos <[email protected]>
Co-authored-by: Chris Mahoney <[email protected]>
Co-authored-by: Azory YData Bot <[email protected]>
Co-authored-by: martin-kokos <[email protected]>
Co-authored-by: Martin Mokry <[email protected]>
Co-authored-by: Maciej Bukczynski <[email protected]>
Co-authored-by: Fabiana Clemente <[email protected]>
  • Loading branch information
15 people authored Dec 7, 2023
1 parent 8d4d347 commit a9c4114
Show file tree
Hide file tree
Showing 5 changed files with 85 additions and 16 deletions.
13 changes: 8 additions & 5 deletions docs/features/collaborative_data_profiling.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
# Data Catalog - A collaborative experience to profile datasets & relational databases
# Data Catalog **
A collaborative experience to profile datasets & relational databases

!!! note "Data Catalog with data quality profiling"
!!! info "** YData's Enterprise feature"

This feature is only available for users of [YData Fabric](https://ydata.ai).

[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **data catalog**
and **collaborative** experience for datasets and database profiling at scale!
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **Data catalog**

[YData Fabric](https://ydata.ai/products/fabric) is a Data-Centric AI
development platform. YData Fabric provides all capabilities of
ydata-profiling in a hosted environment combined with a guided UI
experience.

[Fabric's Data Catalog](https://ydata.ai/products/data_catalog)
[Fabric's Data Catalog](https://ydata.ai/products/data_catalog),
a scalable and interactive version of ydata-profiling,
provides a comprehensive and powerful tool designed to enable data
professionals, including data scientists and data engineers, to manage
and understand data within an organization. The Data Catalog act as a
Expand Down
59 changes: 59 additions & 0 deletions docs/features/pii_identification_management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Personally identifiable information (PII) identification & management **

!!! info "** YData's Enterprise feature"

This feature is only available for users of [YData Fabric](https://ydata.ai).

[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and
start your journey into **data management** with automated PII identification.

Personal Identifiable Information **(PII)** refers to any information that can be used to identify an individual.
This includes but is not limited to, names, addresses, phone numbers, social security numbers, email addresses,
and financial information. PII is crucial in today's digital age, where data is extensively collected, stored,
and processed.

[YData Fabric Data Catalog](https://ydata.ai/products/data_catalog), a scalable and interactive version of ydata-profiling,
integrates into the data profiling experience, an advanced machine learning solutions based on a Named Entity Recognition (NER) model
combine with traditional rule-based patterns identification, allowing to efficiently detect PII.

:fontawesome-brands-youtube:{ .youtube }
<a href="https://www.youtube.com/clip/UgkxBntXvAvCQ6I39Cp2KZRD4Ug9-NPzG1o1"><u>See Fabric's Data Catalog PII identification in action</u></a>.

## Why Fabric Catalog automated PII identification?

The relevance of automating the identification of PII lies in the need to protect individuals' privacy and comply
with various data protection regulations. Mishandling or unauthorized access to PII can lead to severe consequences
such as identity theft, financial fraud, and breaches of privacy. With the increasing volume of data generated manual
identification of PII becomes impractical and error-prone.

Additionally, having a robust PII management solution is essential for organizations to establish and maintain
a secure approach to handling sensitive information, fostering trust and adhering to legal requirements.

## Why Fabric to manage dataset PII identification

Besides automated PII identification, *Fabric Catalog* offers several key benefits in the content of data governance,
privacy compliance and overall data management, through automated data profiling and metadata management:

### Compliance with Privacy Regulations:
Many countries and regions have stringent data protection regulations (such as GDPR, CCPA, or HIPAA)
that require organizations to handle PII responsibly. A dedicated platform ensures that PII is correctly classified,
helping organizations comply with legal requirements and avoid potential penalties.

### Data Profiling for Accuracy:

Data profiling involves analyzing and understanding the structure and content of data. By incorporating data profiling
capabilities into the platform, organizations can ensure accurate identification and classification of PII.
This helps in maintaining the integrity of data and reduces the risk of misclassifications.

### Efficient Management of PII:
As the volume of data continues to grow, manually managing and editing PII classifications becomes impractical.
A platform streamlines this process, making it more efficient and reducing the likelihood of errors.
It allows organizations to keep track of PII across various datasets and systems.

### Facilitating Data Governance:

Data governance involves establishing policies and processes to ensure high data quality, security, and compliance.
A PII management solution enhances data governance efforts by providing a centralized hub for overseeing PII classifications,
metadata, and related policies.


4 changes: 4 additions & 0 deletions docs/features/sensitive_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,7 @@ pd.read_csv("filename.csv", dtype={"phone": str})
Note that the type detection is hard. That is why
[visions](https://github.com/dylan-profiler/visions), a type system to
help developers solve these cases, was developed.

## Automated PII classification & management

You can find more details about this feature [here](pii_identification_management.md).
24 changes: 13 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@ understanding and preparing data for analysis in a single line of code! If you'r

!!! tip "Advent of Code - Get featured on ydata-profiling"

*“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to get more involved with open-source software, but no one’s given you an entry point?
*“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to
get more involved with open-source software, but no one’s given you an entry point?

That's why we joined [The Advent of code this year](https://zilliz.com/advent-of-code). Contribute to ydata-profiling and win some 🐼🐼 swag!

How can you be part of it?

- Give us some love with a Github ⭐
- Write an article or create a tutorial like other [members the communit already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
- Write an article or create a tutorial like other [members the community already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
- Feeling adventurous? Contribute with a PR. We have a list of [great issues to get you started.](https://github.com/ydataai/ydata-profiling/issues?q=label%3A%22getting+started+%E2%98%9D%22+)

![ydata-profiling report](_static/img/ydata-profiling.gif)
Expand Down Expand Up @@ -55,15 +56,16 @@ YData-profiling can be used to deliver a variety of different applications. The

Check out the [free Community Version](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community).

| Features & functionalities | Description |
|------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| [Comparing datasets](features/comparing_datasets.md) | Comparing multiple version of the same dataset |
| [Profiling a Time-Series dataset](features/time_series_datasets.md) | Generating a report for a time-series dataset with a single line of code |
| [Profiling large datasets](features/big_data.md) | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
| [Handling sensitive data](features/sensitive_data.md) | Generating reports which are mindful about sensitive data in the input dataset |
| [Dataset metadata and data dictionaries](features/metadata.md) | Complementing the report with dataset details and column-specific data dictionaries |
| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations |
| [Profiling Databases **](features/collaborative_data_profiling.md) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
| Features & functionalities | Description |
|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| [Comparing datasets](features/comparing_datasets.md) | Comparing multiple version of the same dataset |
| [Profiling a Time-Series dataset](features/time_series_datasets.md) | Generating a report for a time-series dataset with a single line of code |
| [Profiling large datasets](features/big_data.md) | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
| [Handling sensitive data](features/sensitive_data.md) | Generating reports which are mindful about sensitive data in the input dataset |
| [Dataset metadata and data dictionaries](features/metadata.md) | Complementing the report with dataset details and column-specific data dictionaries |
| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations |
| [Profiling Relational databases **](features/collaborative_data_profiling.md) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
| [PII classification & management **](features/pii_identification_management.md ) | Automated PII classification and management through an UI experience |

### Tutorials

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ nav:
- Dataset metadata: 'features/metadata.md'
- Datasets catalog **: 'features/collaborative_data_profiling.md'
- Sensitive data: 'features/sensitive_data.md'
- Automated PII classification & management **: 'features/pii_identification_management.md'
- Time-series: 'features/time_series_datasets.md'
- Comparing datasets: 'features/comparing_datasets.md'
- Big data: 'features/big_data.md'
Expand Down

0 comments on commit a9c4114

Please sign in to comment.