Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
rcurty authored May 21, 2024
1 parent 6537126 commit a3d71c5
Show file tree
Hide file tree
Showing 2 changed files with 337 additions and 0 deletions.
236 changes: 236 additions & 0 deletions modules/week09/southpark-sdcdemo.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
---
title: "Data Anonymization with R's sdcMicro Package"
author: "Renata Goncalves Curty - UCSB Library, Research Data Services"
date: "2023-02-17"
output:
html_document: default
pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## South Park Elementary School Data

![Our Clients](fig/southpark.png)

Mayor McDaniels and Peter Charles (aka PC Principal) are concerned that even after removing direct identifiers such as names, SSNs, and IDs, students may still be easily re-identified in the yearly assessment dataset and have their math and reading scores revealed. For example, everyone in school knows that Tolkien Williams is the wealthiest kid in the whole town, whereas Kenny and his sister Karen are from a very poor family.

They have requested our assistance to compute this risk of disclosure, implement strategies to minimize it, and determine information loss for the anonymized dataset they would like to make public to other school board members\*. They asked for our help, and we will be using the sdcMicro package for this purpose.

In summary, our client has three main questions to for us (and none of them involve finding out who keeps killing Keny and how come he keeps coming back to life):

*Q1. What is the level of disclosure risk associated with this dataset?*

*Q2. How can the risk of re-identification be significantly reduced?*

*Q3. What would be the utility and information loss after implementing the anonymization strategies?*

\*Caveat: We have a relative small dataset for this exercise (rows and columns, so we can't strive for some of the tresholds recommended in the literature.

#### Package & Data

```{r}
library(sdcMicro)
data <- read.csv("southpark-sdc.csv")
```

#### Taking a closer look at the variables included in this dataset

```{r}
# Read the CSV dataset into a data frame
?
# Show the list of variable names
?
```

#### Data Prep - Converting variables

As we can see, we will need to convert some of the variables first.

The stu-id, SSN, name and dob will be removed soon from the dataset as they are direct identifiers.

Let's focus on the remaining ones that should be converted before we can proceed.

```{r}
fname = "southpark-sdc.csv"
file <- read.csv(fname)
file <- varToFactor(obj=file, var=c("zip","age", "sex","race","ethn", "snap", "income", "learn_dis","phys_dis"))
#Convert to numeric math_sc and read_sc
?
```

#### Q1. What is the level of disclosure risk associated with this dataset?

To answer this question we have to set up an SDC problem. In other words we must select variables and create an object of class *sdcMicroObj* for the SDC process in *R.*

```{r}
# Select variables for creating sdcMicro object
# All variable names should correspond to the names in the data file
# select categorical key variables - aka quasi-identifiers
sdcInitial <- createSdcObj(dat=file,
keyVars=c(?),
numVars=c(?),
weightVar=NULL,
hhId=NULL,
strataVar=NULL,
pramVars=NULL,
excludeVars=c(?),
seed=0,
randomizeRecords=FALSE,
alpha=c(1))
# Summary of object
?
```

What about the stu_id? Why we are keeping it?

Check the results below, and the number of observations that violate 2-5 anonymity. What does that mean?

##### Time to calculate the risk of re-identification for the entire dataset

```{r}
# The treshold depends on the size of the dataset and the access control (conservative number for large surveys are 0.04)
?
```

Was it good?

Let's see if we can get that lowered to less than 15% and a k=5.

We have to get some work done to reduce that. But that would be the first answer to our clients.

We can inspect this issue a little further before moving to the second question.

##### Which observations/subjects have a higher risk to be re-identified?

```{r}
```

##### How many combinations of key variables each record have?

```{r}
#Categorical variable risk
#Frequency of the particular combination of key variables (quasi-identifiers) for each record in the sample
?
```

#### Q2. How can the risk of re-identification be significantly reduced?

We learned that there are different techniques to de-identify and anonymize datasets.

First, let's use some non-perturbative methods such as global recoding and top and bottom coding techniques.

*Income*

As mentioned before, the household income of some students may pose a risk to their privacy in this dataset. So let's see if using top and bottom recoding could help reducing that risk.

```{r}
# Frequencies of income before recoding
table(sdcInitial@manipKeyVars$income)
```

```{r}
## Recode variable income (top coding)
sdcInitial <- groupAndRename(obj= sdcInitial, var= c("income"), before=c("200,000-249,999","500,000+"), after=c("200,000+"))
## Recode variable income (bottom coding)
sdcInitial <- groupAndRename(obj= sdcInitial, var= c("income"), before=c("10,000-24,999","75,000-99,999"), after=c("10,000-99,999"))
```

*Age*

```{r}
# Frequencies of age before recoding
?
```

```{r}
#Recode Age (top and bottom)
?
```

##### **Note: Undoing things**

```{r}
# Important note: If the results are reassigned to the same sdcMicro object, it is possible to undo the last step in the SDC process. Using:
# sdcInitial <- undolast(sdcInitial)
# It might be helpful to tune some parameters. The results of the last step, however, will be lost after undoing that step.
# We can also choose to assign results to a new sdcMicro object this time, using:
# sdc1 <- functionName(sdcInitial) specially if you anticipate creating multiple sdc problems to test out.Otherwise, you can delete the object and re-run the code when needed
```

Let's see if those steps lowered the risk of re-identification of subjects.

```{r}
?
```

Only a tiny improvement compared to the original dataset. Let's try something else.

##### Time for a more powerful technique. Let's use the k-anonymization function!

```{r}
#Local suppression to obtain k-anonymity
?
# Setting the parameters that we are aiming for at least 5 observations sharing the same attributes in the dataset.
#Alternatively, we could have set the order of importance for each keyvariables
#sdcInitial <- kAnon(sdcInitial, importance=c(9,5,6,7,8,4,3,1,2), k=c(5))
```

More on importance (pg. 50): <https://cran.r-project.org/web/packages/sdcMicro/sdcMicro.pdf>

Time to check it again:

```{r}
?
```

Alright! We managed lower the risk of identification from 81% to about 10% and now we have 0 observations violating 5-anonymity! We can tell our clients we used some recoding, but supression via k-anonymity was necessary to improve the privacy level of this dataset.

#### Q3. What would be the utility and information loss after implementing anonymization strategies?

##### Time to measure the utility and information loss for the anonymized dataset.

```{r}
#First we retrieve the total number of suppressions for each categorical key variable
?
```

```{r}
#We can also compare the number of NAs before and after our interventions
# Store the names of all categorical key variables in a vector
namesKeyVars <- names(sdcInitial@manipKeyVars)
# Matrix to store the number of missing values (NA) before and after anonymization
NAcount <- matrix(NA, nrow = 2, ncol = length(namesKeyVars))
colnames(NAcount) <- c(paste0('NA', namesKeyVars)) # column names
rownames(NAcount) <- c('initial', 'treated') # row names
# NA count in all key variables (NOTE: only those coded NA are counted)
for(i in 1:length(namesKeyVars)) {
NAcount[1, i] <- sum(is.na(sdcInitial@origData[,namesKeyVars[i]]))
NAcount[2, i] <- sum(is.na(sdcInitial@manipKeyVars[,i]))}
# Show results
NAcount
```

Based on the results we can tell PC Principal and the Mayor that the supression greatly reduced the level of detail about the income and the race of the students. We could continue exploring removing other less relevant variables and explore other functions in this package or even considering different ways of recoding that variable. But let's call the day for today, and export the anonymized dataset we produced.

##### Creating a new random number to replace the student ID

```{r}
## Adding a new randomized ID-variable
?
```

##### Exporting the anonymized dataset

```{r}
writeSafeFile(obj=sdcInitial, format="csv", randomizeRecords="no", col.names=TRUE, sep=",", dec=".", fileOut="southpark-anon.csv")
```
101 changes: 101 additions & 0 deletions modules/week09/southpark-sdcdemo.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
zip,stu_id,ssn,name,dob,age,sex,race,ethn,snap,income,learn_dis,phys_dis,math_sc,read_sc
80220,8206630976,998126245,Stan Marsh,10/19/2012,10,Male,White,Non-hispanic,0,"200,000-249,999",0,0,299,300
80220,6555504757,807281100,Kyle Broflovski,05/26/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,209,209
80220,5737953702,890807948,Kenny McCormick,03/12/2011,11,Male,White,Non-hispanic,1,"10,000-24,999",0,0,200,201
80220,5705942436,991920659,Eric Cartman,07/01/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,211,215
80220,2809004240,921479968,Butters Scotch,11/11/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,224,230
80220,4486369132,804989533,Clyde Donovan,04/10/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,213,227
80220,4038126650,854569146,Wendy Testaburger,12/04/2013,9,Female,White,Non-hispanic,0,"100,000-149,999",0,0,204,210
80221,6008064113,761499326,Bebe Stevens,01/01/2013,10,Female,White,Non-hispanic,0,"75,000-99,999",0,0,202,214
80220,8307803951,925072083,Tolkien Williams,05/25/2012,10,Male,Black,Non-hispanic,0,"500,000+",0,0,202,222
80220,3787379332,772439783,Timmy Burch,11/25/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",1,1,205,225
80221,6685370248,693123835,Jimmy Valmer,06/20/2011,11,Male,White,Non-hispanic,0,"75,000-99,999",0,1,211,206
80221,6730800673,947344677,Craig Tucker,03/05/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,205,226
80220,7961994919,795573368,Tweek Tweak,09/08/2013,9,Male,White,Non-hispanic,0,"75,000-99,999",1,0,225,190
80220,4109750140,784443358,Karen McCormick,01/31/2014,9,Female,White,Non-hispanic,1,"10,000-24,999",0,0,220,208
80222,5809626852,727780211,Scott Malkinson,02/28/2013,9,Male,White,Non-hispanic,0,"75,000-99,999",0,0,310,280
80220,3022294345,931425223,Kevin Stoley,06/02/2013,9,Male,White,Non-hispanic,0,"75,000-99,999",0,0,217,203
80221,2503282093,923511748,Ike Broflovksi,10/14/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,225,227
80222,3120649456,915859337,Firkle Smith,12/16/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,223,214
80222,8281247724,747094897,Pete Thelman,10/24/2013,9,Male,White,Non-hispanic,0,"100,000-149,999",0,0,204,208
80220,7009901765,731342745,Bradley Biggle,02/13/2013,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,220,200
80222,3454129258,950392557,Charlotte Knobs,05/10/2013,9,Female,White,Non-hispanic,0,"75,000-99,999",1,0,205,209
80220,6940797462,712886703,Jenny Simons,03/17/2012,10,Female,White,Non-hispanic,1,"100,000-149,999",0,0,215,215
80221,6498370605,730143577,Sophie Gray,11/25/2012,10,Female,White,Non-hispanic,0,"200,000-249,999",0,0,209,205
80220,7411380937,745820080,Damien Thorn,08/01/2013,9,Male,White,Non-hispanic,0,"100,000-149,999",0,0,204,223
80221,4858260462,889675717,Jason White,08/09/2013,9,Male,White,Non-hispanic,0,"75,000-99,999",0,0,214,226
80221,3179780954,826397725,David Rodriguez,12/14/2013,9,Male,White,Hispanic,0,"100,000-149,999",0,0,208,229
80220,5414029866,742465554,Red McArthu,09/09/2013,9,Female,White,Non-hispanic,0,"100,000-149,999",0,1,221,202
80221,8032142324,676029102,Sally Turner,04/05/2012,10,Female,White,Non-hispanic,0,"200,000-249,999",0,0,206,229
80220,6371437335,861512602,Allie Nelson,02/25/2012,10,Female,White,Non-hispanic,0,"200,000-249,999",0,0,222,229
80222,2441678663,981476167,Kelly-Ann Barlow,04/18/2012,10,Female,White,Non-hispanic,0,"200,000-249,999",0,0,224,215
80220,2946755760,817924686,Larry Feegan,03/20/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,200,221
80220,6597334829,914272742,Shelly Marsh,01/09/2010,13,Female,White,Non-hispanic,0,"100,000-149,999",0,0,219,229
80221,8687651665,992929130,Kay Chi,02/13/2015,8,Male,Asian,Non-hispanic,0,"75,000-99,999",0,0,218,202
80222,7044694117,861295557,Lee Roberts,09/03/2013,9,Male,Asian,Non-hispanic,0,"100,000-149,999",0,0,217,225
80222,2383266993,890054995,Donna Base,07/15/2012,10,Female,Black,Non-hispanic,0,"200,000-249,999",0,0,214,207
80220,5842799162,809048300,Rose River,05/02/2011,11,Female,White,Non-hispanic,0,"100,000-149,999",0,0,219,220
80222,5511548259,874008998,George Kuala,08/12/2012,10,Male,NA,Non-hispanic,0,"75,000-99,999",1,0,211,214
80221,8777913067,731193948,Jamal Campos,04/04/2012,10,Male,White,Hispanic,0,"200,000-249,999",0,0,202,226
80221,8721700078,680534365,Henry Fords,03/02/2011,11,Male,White,Non-hispanic,0,"100,000-149,999",1,0,211,217
80221,6139178090,676782088,Amelia Papimidous,10/02/2012,10,Female,White,Non-hispanic,0,"100,000-149,999",0,0,208,228
80220,8065237237,893247941,Tom Battle,03/05/2011,11,Male,White,Non-hispanic,0,"100,000-149,999",0,0,221,207
80222,4139401352,720273141,Fatima Ali,05/24/2011,11,Female,White,Non-hispanic,0,"75,000-99,999",1,1,202,225
80220,6118666721,890255724,Ahmed Khan,09/06/2011,11,Male,Asian,Non-hispanic,0,"75,000-99,999",0,0,205,210
80221,6482073290,902205262,Maria Rodriguez,12/15/2011,11,Female,White,Hispanic,0,"100,000-149,999",0,0,200,215
80220,5103331881,801617194,Kim Lee,02/08/2012,11,Female,Asian,Non-hispanic,0,"75,000-99,999",0,0,215,207
80220,5662616805,946404891,Thomas Smith,06/29/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,200,205
80221,4026604192,736706846,Jasmine Davis,11/18/2012,10,Female,Black,Non-hispanic,0,"75,000-99,999",0,1,222,206
80220,5114724288,951436962,Ahmed Mohammed,01/26/2013,10,Male,White,Non-hispanic,1,"100,000-149,999",1,0,218,207
80222,3604059408,974991555,Sofia García,04/12/2013,9,Female,White,Non-hispanic,0,"75,000-99,999",0,0,224,223
80220,8309202481,782831373,Wei Chen,07/23/2013,9,Male,Asian,Non-hispanic,0,"100,000-149,999",1,0,205,200
80220,8572729673,840029061,David Brown,10/05/2011,11,Male,White,Non-hispanic,0,"75,000-99,999",0,0,207,217
80221,2358285712,947687323,Aaliyah Jackson,01/14/2012,11,Female,Black,Non-hispanic,0,"100,000-149,999",0,0,216,216
80220,6039967434,725131243,Omar Hassan,03/27/2012,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,213,228
80221,7152325605,853935170,Isabella Sanchez,06/17/2012,10,Female,White,Hispanic,0,"75,000-99,999",0,0,203,217
80221,3237419728,679200682,Min Lee,09/08/2012,10,Female,Asian,Non-hispanic,0,"100,000-149,999",0,0,207,205
80222,6408366982,962422496,Matthew Taylor,12/21/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",1,0,200,220
80221,6075761054,983690557,Nia Wilson,02/19/2013,9,Female,White,Non-hispanic,0,"100,000-149,999",1,0,220,229
80222,3362479883,814225395,Tariq Ahmed,05/10/2013,9,Male,White,Non-hispanic,0,"100,000-149,999",0,0,203,226
80222,2271343893,985949515,Juan Gonzalez,08/25/2013,9,Male,White,Hispanic,0,"100,000-149,999",0,0,213,206
80220,7915099255,688180331,Yuna Kim,11/07/2011,11,Female,Asian,Non-hispanic,0,"200,000-249,999",0,0,201,219
80222,5693148881,751760341,William Jones,01/20/2012,11,Male,White,Non-hispanic,0,"75,000-99,999",0,0,214,227
80220,6503772597,941474697,Leah Harris,04/08/2012,10,Female,White,Non-hispanic,0,"75,000-99,999",0,0,225,204
80221,4880530272,850574744,Muhammad Ali,07/02/2012,10,Male,White,Non-hispanic,0,"200,000-249,999",0,0,212,214
80222,7764246104,682988160,Rosa Martinez,09/15/2012,10,Female,White,Hispanic,0,"200,000-249,999",0,0,208,220
80222,6436583685,869234973,Jie Zhang,12/30/2012,10,Female,Asian,Non-hispanic,0,"75,000-99,999",0,0,203,221
80221,2353134813,728912982,Benjamin Wilson,02/12/2013,10,Male,White,Non-hispanic,0,"75,000-99,999",0,0,202,223
80222,8006712610,980850357,Madison Johnson,05/03/2013,9,Female,Black,Non-hispanic,0,"75,000-99,999",0,0,217,214
80220,7952143247,910514155,Samir Bakr,08/16/2013,9,Male,White,Non-hispanic,0,"75,000-99,999",0,0,207,230
80222,5709468005,913828533,Ana Maria,11/05/2012,10,Female,White,Hispanic,0,"75,000-99,999",0,0,222,226
80222,6867660951,992438714,Kenji Nakamura,01/24/2013,10,Male,Asian,Non-hispanic,0,"75,000-99,999",0,0,223,201
80221,3870154420,815924622,Andrew Johnson,04/14/2013,9,Male,White,Non-hispanic,1,"200,000-249,999",0,0,211,214
80220,7797506673,844679624,Destiny Wilson,07/08/2013,9,Female,Black,Non-hispanic,0,"200,000-249,999",0,0,204,221
80220,2494395308,722181895,Tarek Farouk,10/22/2011,11,Male,White,Non-hispanic,0,"200,000-249,999",0,0,223,207
80221,8546373623,850279771,Carlos Hernandez,12/10/2011,11,Male,White,Hispanic,0,"200,000-249,999",0,0,201,204
80222,8656992107,718288946,Min-ji Park,03/05/2012,10,Female,Asian,Non-hispanic,0,"200,000-249,999",0,0,206,227
80222,6958413848,962473136,Jacob Smith,06/19/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,218,208
80222,5523349842,801869852,Shayla Adams,09/06/2012,10,Female,Black,Non-hispanic,0,"100,000-149,999",0,0,200,201
80221,6595472746,951090549,Ahmed Ibrahim,12/01/2012,10,Male,White,Non-hispanic,0,"200,000-249,999",0,0,203,226
80220,7489643955,915131594,Isabel Rodriguez,02/28/2013,9,Female,White,Hispanic,0,"100,000-149,999",0,0,219,214
80220,2823018885,680803261,Xian Chen,05/16/2013,9,Male,Asian,Non-hispanic,0,"75,000-99,999",0,1,210,216
80221,7451127312,773234402,John Brown,08/31/2013,9,Male,White,Non-hispanic,0,"200,000-249,999",0,0,211,210
80220,7677217980,855443490,Amara Jones,11/17/2011,11,Female,Black,Non-hispanic,0,"200,000-249,999",0,0,213,214
80221,4783679758,902383995,Saeed Al-Saud,02/03/2012,11,Male,NA,Non-hispanic,0,"75,000-99,999",0,0,215,230
80221,2817826392,702406777,Juanita Lopez,04/23/2012,10,Female,White,Non-hispanic,0,"100,000-149,999",0,0,219,218
80221,7654249506,943837518,Yuyu Lee,07/15/2012,10,Female,Asian,Non-hispanic,0,"200,000-249,999",0,0,213,213
80220,2935358711,848394421,Benjamin Turner,10/08/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",1,0,201,219
80220,4934388712,993842853,Lauren Wilson,12/25/2012,10,Female,Black,Non-hispanic,0,"100,000-149,999",1,0,205,203
80220,5360939504,988758591,Tariq Mustafa,03/09/2013,9,Male,NA,Non-hispanic,0,"200,000-249,999",0,0,213,220
80221,7221658825,842610748,Carlos Martinez,06/01/2013,9,Male,White,Non-hispanic,0,"100,000-149,999",0,0,200,204
80222,2777782104,677272490,Tomomi Nakamura,08/20/2013,9,Female,Asian,Non-hispanic,0,"200,000-249,999",0,0,210,214
80220,2657793427,927243456,Michael Davis,11/12/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,224,203
80220,3246294874,841276165,Alberta Lowe,06/02/2011,11,Female,White,Non-hispanic,0,"100,000-149,999",0,0,211,219
80222,2522484954,692145969,Macy Brown,10/14/2013,9,Female,Black,Non-hispanic,0,"100,000-149,999",0,0,205,207
80222,6226479455,769066928,Joe Baker,12/16/2012,10,Male,White,Non-hispanic,0,"200,000-249,999",0,0,212,216
80220,8567366395,929068451,Samuel Tetris,10/24/2012,10,Male,White,Non-hispanic,0,"100,000-149,999",0,0,215,215
80220,3511819696,845874181,Paul Stage,02/13/2011,12,Male,White,Non-hispanic,0,"100,000-149,999",0,0,215,228
80221,8430191132,709355630,Lee King,05/10/2012,10,Male,Asian,Non-hispanic,0,"200,000-249,999",0,0,208,208
80220,4516023907,719087380,Benjamin Power,03/03/2012,10,Male,White,Non-hispanic,0,"200,000-249,999",0,0,213,210
80220,7032553463,723535202,Debra Smith,06/06/2012,10,Female,White,Non-hispanic,0,"100,000-149,999",0,0,201,217
80220,4986378999,720104857,Morgan Ellins,10/02/2011,11,Female,White,Non-hispanic,0,"100,000-149,999",0,0,209,213

0 comments on commit a3d71c5

Please sign in to comment.