Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write attack for Uber differential privacy anonymization #29

Open
yoid2000 opened this issue Nov 20, 2018 · 49 comments
Open

Write attack for Uber differential privacy anonymization #29

yoid2000 opened this issue Nov 20, 2018 · 49 comments
Assignees

Comments

@yoid2000
Copy link
Contributor

We're going to use this to attack the Uber anonymization system. I'm not sure what queries that system allows, but @rbh-93 is working on it, so he can answer questions about that or give you access to an implementation.

In our attack, we want to make a query that has exactly one user in the answer with some reasonable probability. In the attack, we find out if that is the case or not. If it is the case, then we make a singling-out claim for that user. If not, then we don't make a claim.

The first step is to find sets of column values or value ranges that have a good chance of identifying a single user. If you know the number of distinct users associated with any given column value, and you know the number of users in the table, then prob_user1 = col_val_users1/total_users is the probability that any given user has that column value. Then you want to find cases where:

total_users * prob_user1 * prob_user2 * ... = 1 (roughly)

In other words, the expected number of users with column/value 1 and column/value 2 and ... is one.

You can learn the total users with:

select count(distinct uid)
from table

To learn these probabilities for any given column, you can query the raw database with this query:

select column, count(distinct uid)
from table
order by 2 desc
limit 200

Use the askExplore() call on the raw database (rawDb) to do these.

Once you have a set of columns and values where this is the case, you can make a query like this:

select count(distinct uid)
from table
where col1 = val1 and col2 = val2 and ...

For the Uber system, each time you repeat the query, you get a new noise value with mean zero. So if you take X answers and take the average, you'll get the true answer with some probability.

After X queries, we predict that the true answer is 1 if the averaged answer is between 0.5 and 1.5.

We repeat the above X times and make a guess. For this query, use the askAttack() call, so that the system records it as an attack query. Once you have a guess, use the askClaim() call to record the guess. You can see examples of how these are used for other attacks in code/attacks.

@AnirbanGhosh1512
Copy link
Contributor

Started Working on it.

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

It takes much time for me to understand the exact requirements. Please tell me that whatever I understood is right or not.

  1. I need to use the rest API which is build by @rbh-93 to learn the probabilities.
  2. Once I get the set of columns, I can use askAttack() and askClaim() to predict the true answer from the attack script.

Regards,
Anirban Ghosh

@yoid2000
Copy link
Contributor Author

We will incorporate Rohan's REST interface into gdaScore, so you won't use his interface directly. Rather, you'll use askExplore() to make the preliminary queries, askAttack() to make the attack queries (to establish an average value), and askClaim() to make a claim about your guessed answer.

Until we have incorporated Rohan's REST interface, you can test your code against rawDb. I'm out of town right now, but will be back on Friday if you want to chat about it.

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Nov 30, 2018 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Nov 30, 2018 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 3, 2018

@AnirbanGhosh1512

As a step in this attack, you make a query like

select count(distinct uid)
from table
where col1 = val1 and col2 = val2 and ...

I have written a class method called getPublicColValues() which is meant to return a set of column values that may reasonably be publicly know. You can read about this interface at https://gda-score.github.io/gdaScore.m.html

When you write the part that looks for appropriate values, please limit yourself to values discovered by getPublicColValues()

Let me know if you have questions

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

Below .json is currently my configuration.
{
"localBankingRaw": {
"host": "db001.gda-score.org",
"port": 5432,
"dbname": "banking",
"user": "[email protected]",
"password": "Aic0phuLoo0i",
"type": "postgres"
},
"cloakBankingAnon": {
"host": "attack.aircloak.com",
"port": 8432,
"dbname": "banking",
"user": "[email protected]",
"password": "secret",
"type": "aircloak"
}
}

First one localBankingRaw as a config string working fine for me but the second one cloakBankingAnon seems like consist unauthorized parameters to get access to the db. As I tried with the settings of my colleague Ali Reza, its working fine. Perhaps I need an access in attack.airclock.com.

Regards,
Anirban

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

Thanks, now It is working with my newly created login.

Regards,
Anirban

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 11, 2018 via email

@AnirbanGhosh1512
Copy link
Contributor

@AnirbanGhosh1512

As a step in this attack, you make a query like

select count(distinct uid)
from table
where col1 = val1 and col2 = val2 and ...

I have written a class method called getPublicColValues() which is meant to return a set of column values that may reasonably be publicly know. You can read about this interface at https://gda-score.github.io/gdaScore.m.html

When you write the part that looks for appropriate values, please limit yourself to values discovered by getPublicColValues()

Let me know if you have questions

Hello Prof. Paul,

As per the stated issue, you asked me to use below:
To learn these probabilities for any given column, you can query the raw database with this query:

select column, count(distinct uid)
from table
order by 2 desc
limit 200
Use the askExplore() call on the raw database (rawDb) to do these.

as per my findings askExplore is nothing but a queue to hold queries. But getPublicColValues() already have the query written dynamically. Just I need to send column names using a loop. Then based on the result I can calculate the probabilities and generate attack query.

Am I right? Please let me know if I misunderstood.

Regards,
Anirban Ghosh

@yoid2000
Copy link
Contributor Author

Yes, your understanding is correct. You can loop through the column names and learn a set of values

By the way, there is also a method in class gdaAttack() called getTableCharacteristics that returns various statistics about each of the columns, including the number of distinct UIDs, the number of distinct values, the average number of UIDs per value, and things like that. You can read more about it at:

https://gda-score.github.io/gdaScore.m.html#gdaScore.gdaAttack.getTableCharacteristics

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

The method getPublicColValues() rejected those values which are less than 100 as per the written code.
So is it ok to use this method or Should I write something new to fetch all the records even if the value is less than 100.

Regards,
Anirban

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 17, 2018 via email

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

If I have a frequency column as an example giving the output using this query:
{select frequency, count(distinct account_id) from accounts group by frequency order by 2 desc limit 200}
frequency count
"POPLATEK MESICNE" "4167"
"POPLATEK TYDNE" "240"
"POPLATEK PO OBRATU" "93"

So for the next query as per the issue stated:
{select count(distinct uid) from table where col1 = val1 and col2 = val2 and ...}

would it be like this:
{select count(distinct account_id)
from accounts where frequency = 'POPLATEK MESICNE' and frequency = 'POPLATEK TYDNE' and frequency = 'POPLATEK PO OBRATU'}

Please reply about my understanding:

Regards,
Anirban

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 18, 2018 via email

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

By calling routine getPublicColValues() gives me the below output:

{ 'acct_district_id': [(1, 554), (70, 152), (74, 135), (54, 128)],
'cli_district_id': [(1, 547), (70, 146), (74, 144), (54, 133)],
'disp_type': [('OWNER', 4500), ('DISPONENT', 869)],
'frequency': [('POPLATEK MESICNE', 4167), ('POPLATEK TYDNE', 240)]}

Before writing the query {select count(distinct uid) from table where col1 = val1 and col2 = val2 and ...}, I need some clarification which seems would be good by a chat in your office.

Can I stop by in your office in the next few days to clarify my understanding before I proceed?

Regards,
Anirban Ghosh

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 19, 2018 via email

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,
The actual output is below:

{ 'account_id': [],
'acct_date': [],
'acct_district_id': [(1, 554), (70, 152), (74, 135), (54, 128)],
'birth_number': [],
'cli_district_id': [(1, 547), (70, 146), (74, 144), (54, 133)],
'client_id': [],
'disp_type': [('OWNER', 4500), ('DISPONENT', 869)],
'frequency': [('POPLATEK MESICNE', 4167), ('POPLATEK TYDNE', 240)],
'lastname': []}

I checked a condition if the returned value is [], then no need to consider. I am available after 3 pm tomorrow, So I can come to your office.

Regards,
Anirban

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 19, 2018 via email

@yoid2000
Copy link
Contributor Author

I changed the parameters of getPublicColValues() so that it returns somewhat more. Please pull the latest code repo and try running your code again. I'll see you this afternoon.

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

I take the latest code-base. Still, I am getting the same output. I checked the gui of Git and it shows no recent changes in the gda-score script. I wonder that is it updated or I miss something.

Regards,
Anirban

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

A gentle reminder.

Regards,
Anirban

@yoid2000
Copy link
Contributor Author

yoid2000 commented Dec 27, 2018 via email

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

Sorry for being a late response. I got new output after calling the routine getPublicColValues() in gdAScore script.
Now my question is: Are the columns which have some values as an example, 'acct_district_id' always fixed when I call a routine, Will it be affected later on if any changes of the database?
If I simplify it currently the columns which comes as an output are:
'acct_district_id', cli_district_id, disp_type, frequency, lastname.

Now if I write the logic to build this query select count(distinct uid)
from table
where col1 = val1 and col2 = val2 and ..., I need to use combinatorics for 5 columns, but in case if it is 6 in future then this script will not be considered as a dynamic script. It would be static and work only for those columns.

Please let me know if it is ok for you so that I can start writing the logic for building the query.

Regards,
Anirban

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jan 4, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

Hello Prof. Paul,

I need a little clarification for the last the discussion. If the query results average is greater than 1.0, then I can ask for a claim or whatever the mean value is I can go for a claim?

Regards,
Anirban Ghosh

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jan 23, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Jan 29, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jan 29, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Jan 29, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Jan 29, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jan 30, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Jan 30, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jan 31, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 5, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 5, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 5, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 5, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 5, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 5, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 5, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 5, 2019

When attacking the cloak, in your .json config file, you should set 'rawDb' to the raw database, and 'anonDb' to the cloak. In the configuration, 'rawDb' should always be set to the raw database, and 'anonDb' is set to whatever anonymization system you are attacking.

Then, when you use getPublicColValues, it will naturally query the raw database, and you will get the correct answers (in fact, you get exactly the same answer as before).

In other words, your attack queries will be the same no matter what system you are attacking.

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 6, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 6, 2019 via email

@AnirbanGhosh1512
Copy link
Contributor

{
"localBankingRaw": {
"host": "db001.gda-score.org",
"port": 5432,
"dbname": "banking",
"user": "[email protected]",
"password": "Aic0phuLoo0i",
"type": "postgres"
},
"cloakBankingAnon": {
"host": "demo.aircloak.com",
"port": 8432,
"dbname": "gda_banking",
"user": "[email protected]",
"password": "anirban@123",
"type": "aircloak"
}
}

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 7, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 7, 2019

Did you forget to leave the attachment?

@AnirbanGhosh1512
Copy link
Contributor

AnirbanGhosh1512 commented Feb 7, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 8, 2019

Since in fact your emails are transmitted through github, it could be that the attachment was stripped. Please just send it to me directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants