Feature Request: Enhance Sampling Mechanism in presidio-structured to Exclude Null Values #1291

ebotiab · 2024-02-09T11:55:04Z

ebotiab
Feb 9, 2024

The current implementation of presidio-structured samples a fixed number of rows at random to limit computation. However, this approach does not account for null values within the sampled rows. This can lead to a scenario where the sampled data is not representative due to a high volume of null values, thereby reducing the effectiveness of sensitive data identification.

I propose an enhancement to the sampling mechanism where the system iterates through each column individually to perform the sampling. This iteration would ensure that the sampled rows for each column are devoid of null values, thus maintaining the representativeness and integrity of the sample. Such a method would improve the accuracy of sensitive data detection by ensuring that the analysis is performed on meaningful data rather than null or empty values.

As an alternative, a pre-sampling data cleaning step could be introduced, where rows with a high volume of null values are filtered out before the sampling process begins. However, this might lead to the exclusion of potentially relevant data and increase preprocessing overhead. Another approach could involve a more complex sampling algorithm that weighs the presence of non-null values across different columns, but this could significantly increase computational complexity and processing time.

Answered by omri374

Feb 13, 2024

Thanks @ebotiab. This sounds like a great addition. If you'd like to give it a first attempt, we'd be happy to collaborate.

View full answer

omri374 · 2024-02-09T12:11:46Z

omri374
Feb 9, 2024
Maintainer

Hi @ebotiab,
as the user is the one who is passing the data frame, can't this be done a priori? Moreover, one can sample the data frame using the logic of their choice, perform the analysis of which columns are sensitive, and then apply this on the full data.

Having said that, if you have a specific idea of a pre-sampling step which could be extended to support arbitrary logic, that would be a great addition to the tool.

3 replies

ebotiab Feb 13, 2024
Author

Thank you @omri374. You're right that users can preprocess the DataFrame to exclude null values and sample data using their preferred logic before passing it to presidio-structured for analysis. This approach does offer flexibility and allows users to tailor the preprocessing to their specific needs.

Building on this, I believe integrating a PandasPreprocessor which inherits from TabularPreprocessor and a JSONPreprocessor object into the framework could offer a structured and extendable way to handle not only sampling but also other common preprocessing steps. This Preprocessor could provide a suite of ready-to-use functions, like null value handling, normalization, or even more sophisticated data transformation methods, which users can optionally leverage before the sensitive data identification process.

omri374 Feb 13, 2024
Maintainer

Thanks @ebotiab. This sounds like a great addition. If you'd like to give it a first attempt, we'd be happy to collaborate.

Answer selected by ebotiab

ebotiab Feb 13, 2024
Author

Absolutely, I'd be glad to contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Enhance Sampling Mechanism in presidio-structured to Exclude Null Values #1291

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Feature Request: Enhance Sampling Mechanism in presidio-structured to Exclude Null Values #1291

ebotiab Feb 9, 2024

Replies: 1 comment · 3 replies

omri374 Feb 9, 2024 Maintainer

ebotiab Feb 13, 2024 Author

omri374 Feb 13, 2024 Maintainer

ebotiab Feb 13, 2024 Author

ebotiab
Feb 9, 2024

Replies: 1 comment 3 replies

omri374
Feb 9, 2024
Maintainer

ebotiab Feb 13, 2024
Author

omri374 Feb 13, 2024
Maintainer

ebotiab Feb 13, 2024
Author