Feature Request: Enhance Sampling Mechanism in presidio-structured to Exclude Null Values #1291
-
The current implementation of presidio-structured samples a fixed number of rows at random to limit computation. However, this approach does not account for null values within the sampled rows. This can lead to a scenario where the sampled data is not representative due to a high volume of null values, thereby reducing the effectiveness of sensitive data identification. I propose an enhancement to the sampling mechanism where the system iterates through each column individually to perform the sampling. This iteration would ensure that the sampled rows for each column are devoid of null values, thus maintaining the representativeness and integrity of the sample. Such a method would improve the accuracy of sensitive data detection by ensuring that the analysis is performed on meaningful data rather than null or empty values. As an alternative, a pre-sampling data cleaning step could be introduced, where rows with a high volume of null values are filtered out before the sampling process begins. However, this might lead to the exclusion of potentially relevant data and increase preprocessing overhead. Another approach could involve a more complex sampling algorithm that weighs the presence of non-null values across different columns, but this could significantly increase computational complexity and processing time. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @ebotiab, Having said that, if you have a specific idea of a pre-sampling step which could be extended to support arbitrary logic, that would be a great addition to the tool. |
Beta Was this translation helpful? Give feedback.
Thanks @ebotiab. This sounds like a great addition. If you'd like to give it a first attempt, we'd be happy to collaborate.