Measure and store coverage information for raw table #14

yoid2000 · 2018-10-23T19:54:21Z

As discussed in issue #15, for the purpose of computing coverage there are now two types of columns, continuous and enumerative. The contribution to the coverage value will be computed differently for these two types. Specifically, we'll treat enumerative somewhat like we've been doing, but we'll do it separately from accuracy measures, and pre-compute the information we need for the raw table and store it in a separate table on the db server machines (db001.gda-score.org and others when they exist).

For enumerative columns, I'd like to pre-compute the number of column value combinations that have more than one distinct user in the raw table. Then for each anonymization method, we'll measure what fraction of these can be viewed in the anonymized table.

For any table tab, I want to create another table tab_cov which contains the enumerative coverage information for that table. Note that continuous columns can be completely ignored in the following.

tab_cov has the following columns:

num_columns: This is the number of columns that comprise the information in the row.
col_names: This is a string that contains the names of all of the columns for the row. Specifically, the string is formated as ,col1,col2,col3.... In other words, each column name is prepended with a comma ,.
num_values: The number of distinct value combinations for the corresponding columns.
num_single_uid: The number of value combinations for which there is one distinct user.
num_multiple_uid: The number of value combinations for which there is more than one distinct user.

Note that, unlike coverage measures up to now, we should compute value combinations for more than two enumerative columns. You can do it like this:

First, compute the above measures for single columns.

Then, for all single columns where more than 1% of the values have multiple distinct users, generate pairs of columns and make the above measures. Then iterate: for pairs where more than 1% of the values have mulitple distinct users, generate groups of three columns, etc.

I would also say that we don't need more than 100 instances of any given combination size. In other words, we don't need more than 100 single columns, 100 pairs of columns, 100 groups of 3 columns, etc.

As with #12, please produce a file with SQL CREATE and INSERT commands for the tables.

The text was updated successfully, but these errors were encountered:

yoid2000 · 2018-10-25T08:44:38Z

@srnb I updated the issue. It is ready now.

yoid2000 assigned srnb Oct 23, 2018

yoid2000 mentioned this issue Oct 25, 2018

Additional label for tab_char information (extension of issue #12) #15

Closed

yoid2000 mentioned this issue Nov 8, 2018

New accuracy measure for utility #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure and store coverage information for raw table #14

Measure and store coverage information for raw table #14

yoid2000 commented Oct 23, 2018 •

edited

Loading

yoid2000 commented Oct 25, 2018

Measure and store coverage information for raw table #14

Measure and store coverage information for raw table #14

Comments

yoid2000 commented Oct 23, 2018 • edited Loading

yoid2000 commented Oct 25, 2018

yoid2000 commented Oct 23, 2018 •

edited

Loading