Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure and store coverage information for raw table #14

Open
yoid2000 opened this issue Oct 23, 2018 · 1 comment
Open

Measure and store coverage information for raw table #14

yoid2000 opened this issue Oct 23, 2018 · 1 comment
Assignees

Comments

@yoid2000
Copy link
Contributor

yoid2000 commented Oct 23, 2018

As discussed in issue #15, for the purpose of computing coverage there are now two types of columns, continuous and enumerative. The contribution to the coverage value will be computed differently for these two types. Specifically, we'll treat enumerative somewhat like we've been doing, but we'll do it separately from accuracy measures, and pre-compute the information we need for the raw table and store it in a separate table on the db server machines (db001.gda-score.org and others when they exist).

For enumerative columns, I'd like to pre-compute the number of column value combinations that have more than one distinct user in the raw table. Then for each anonymization method, we'll measure what fraction of these can be viewed in the anonymized table.

For any table tab, I want to create another table tab_cov which contains the enumerative coverage information for that table. Note that continuous columns can be completely ignored in the following.

tab_cov has the following columns:

  1. num_columns: This is the number of columns that comprise the information in the row.
  2. col_names: This is a string that contains the names of all of the columns for the row. Specifically, the string is formated as ,col1,col2,col3.... In other words, each column name is prepended with a comma ,.
  3. num_values: The number of distinct value combinations for the corresponding columns.
  4. num_single_uid: The number of value combinations for which there is one distinct user.
  5. num_multiple_uid: The number of value combinations for which there is more than one distinct user.

Note that, unlike coverage measures up to now, we should compute value combinations for more than two enumerative columns. You can do it like this:

First, compute the above measures for single columns.

Then, for all single columns where more than 1% of the values have multiple distinct users, generate pairs of columns and make the above measures. Then iterate: for pairs where more than 1% of the values have mulitple distinct users, generate groups of three columns, etc.

I would also say that we don't need more than 100 instances of any given combination size. In other words, we don't need more than 100 single columns, 100 pairs of columns, 100 groups of 3 columns, etc.

As with #12, please produce a file with SQL CREATE and INSERT commands for the tables.

@yoid2000
Copy link
Contributor Author

@srnb I updated the issue. It is ready now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants