forked from arx-deidentifier/arx
-
Notifications
You must be signed in to change notification settings - Fork 0
/
documentation.txt
139 lines (90 loc) · 9.65 KB
/
documentation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
This project provides a generic Java framework for the efficient implementation of globally optimal full-domain anonymity algorithms. It implements several optimizations which require monotonic generalization hierarchies and monotonic metrics for information loss. It further provides an implementation of the Flash algorithm. Flash is a highly efficient algorithm that implements a novel strategy and fully exploits the implementation framework. It offers stable execution times. Currently the k-anonymity and l-diversity anonymity methods are implemented. The ARX framework is copyright (C) 2012 Florian Kohlmayer and Fabian Prasser.
This document provides a brief overview over the ARX API. It covers loading data, defining data transformations, altering and manipulating data and processing the results of the algorithm. More detailed examples are provided in the "examples" package. A Javadoc documentation is available in the "doc" folder.
1. Defining input data
**********************
The class "Data" offers different means to provide data to the ARX framework. This currently includes loading data from a CSV file, or reading data from input streams, iterators, lists or arrays. Data can also be defined manually.
Manually defining input data:
DefaultData data = Data.create();
data.add("age", "gender", "zipcode");
data.add("34", "male", "81667");
Loading input data from a file:
Data data = Data.create("data.csv", ';');
Further methods:
Data.create(File file, char separator);
Data.create(InputStream stream, char separator);
Data.create(Iterator<String[]> iterator);
Data.create(List<String[]> list);
Data.create(String[][] array);
2. Defining the transformation
******************************
In terms of attribute types, the transformation can be defined via the DataDefinition object, which can be retrieved for a Data object by calling data.getDefinition(). The framework distinguishes between four different kinds of attributes, which are encapsulated in the class AttributeType. Insensitive attributes will be kept as is,
directly-identifying attributes will be removed from the dataset, quasi-identifying attributes will be transformed by applying the provided generalization hierarchies and sensitive attributes will be kept as is and can be utilized to generate l-diverse or t-close transformations.
Defining attribute types:
data.getDefinition().setAttributeType("age", AttributeType.IDENTIFYING_ATTRIBUTE);
data.getDefinition().setAttributeType("gender", AttributeType.SENSITIVE_ATTRIBUTE);
data.getDefinition().setAttributeType("zipcode", AttributeType.INSENSITIVE_ATTRIBUTE);
Generalization hierarchies extend the class AttributeType and can therefore be assigned in the same way as the other types of attributes. Generalization hierarchies can be defined in the same way as Data objects.
Manually defining a generalization hierarchy:
DefaultHierarchy hierarchy = Hierarchy.create();
hierarchy.add("81667", "8166*", "816**", "81***", "8****", "*****");
hierarchy.add("81675", "8167*", "816**", "81***", "8****", "*****");
Loading a hierarchy from a file:
Hierarchy hierarchy = Hierarchy.create("hierarchy.csv", ';');
Further methods:
Hierarchy.create(File file, char separator);
Hierarchy.create(InputStream stream, char separator);
Hierarchy.create(Iterator<String[]> iterator);
Hierarchy.create(List<String[]> list);
Hierarchy.create(String[][] array);
Adding a generalization hierarchy to the data definition:
data.getDefinition().setAttributeType("zipcode", hierarchy);
There are further properties that can be set via the class DataDefinition. This includes the minimum and maximum levels of generalization that should be applied for a quasi-identifier. For example, it can be stated that a quasi-identifier should only be generalized to the levels 2-4 of its generalization hierarchy via:
data.getDefinition().setMinimumGeneralization("age", 2);
data.getDefinition().setMaximumGeneralization("age", 4);
Note: These properties are optional. Per default the transformation completely leverages the available generalization hierarchies.
The class DataDefinition also provides means to specify data types for each column. These data types are encapsulated in the class DataType. Currently "String", "Decimal" and "Date" are implemented. The data types are used for sorting the data accordingly (see section 5). All transformed data is automatically treated as a string in the representation of the output data, as it is very unlikely that there is a generalization of a decimal or date that will preserve the data type (e.g, "8881" -> "88**").
data.getDefinition().setDataType("age", DataType.DECIMAL);
data.getDefinition().setDataType("date-of-birth", DataType.DATE("dd.mm.yyyy"));
data.getDefinition().setDataType("city", DataType.STRING);
Note: There is no need to specify a data type. Per default everything is treated as a string. Defining a datatype will yield a more natural order when sorting data, though (see section 5).
The class DataDefinition also provides means to access information about the defined data transformation, such as the types of attributes, the height of the generalization hierarchies etc.
3. Defining the parameters of the anonymization algorithm
*********************************************************
All further parameters of the anonymization algorithm are defined in an instance of the class ARXAnonymizer. This includes general parameters such as the string inserted for suppressed values, the type of suppression and algorithm-specific parameters such as the snapshot size or the maximum size of the history. The algorithm-specific parameters are only performance-related and will not influence the actual result. Furthermore, it is possible to define the metric. The ARX framework currently supports the following metrics, which are derived from the class AbstractMetric:
Height: anonymizer.setMetric(new MetricHeight());
Precision: anonymizer.setMetric(new MetricPrecision());
Monotonic Discernability: anonymizer.setMetric(new MetricDMStar());
Non-Uniform Entropy: anonymizer.setMetric(new MetricEntropy());
4. Executing the algorithm
**************************
The ARX framework currently supports a basic k-anonymity algorithm and recursive-(c,l)-diversity. For a given Data object, the algorithms can be executed by calling:
K-Anonymity: anonymizer.kAnonymize(data, k, suppressionRate);
L-Diversity: anonymizer.lDiversify(data, c, l, suppressionRate);
The method returns an instance of the class ARXResult. This class provides means to access the generalization lattice, which has been searched by the algorithm. Furthermore, it returns whether a k-anonymous transformation has been found, and provides access to the globally optimal node. Via an instance of the class ARXNode it is possible to access the following information:
node.getMinimumInformationLoss(): Returns a lower bound for the information loss of the transformation defined by this node.
node.getMaximumInformationLoss(): Returns an upper bound for the information loss of the transformation defined by this node.
node.getPredecessors(): Returns all predecessors.
node.getSuccessors(): Returns all successors.
node.isAnonymous(): Returns whether the defined transformation is anonymous.
node.getGeneralization(String attribute): Returns the generalization defined for the given quasi identifier.
Lower and upper bounds for information loss are provided because the ARX framework will not compute the information loss for all nodes, as it prunes parts of the search space. If a node has not been checked explicitly its information loss will be between the returned boundaries. A node which has been checked explicitly (which will always be true for the global optimum) will return the same values for the lower and the upper bound.
Note that the l-diversity algorithm currently supports only one sensitive attribute.
5. Accessing and comparing data
*******************************
The ARX framework provides a convenient API for accessing the data in its original form and in a transformed representation. All data access is encapsulated in the abstract class DataHandle. A data handle for the original data can be obtained by calling Data.getHandle() and for globally optimal transformation it can be obtained via ARXResult.getHandle().
The handles for the input and output data are paired, meaning that all operations performed on one of the representations will also be performed on the other representation. This includes sorting the data or swapping rows. Additionally, the data behind a DataHandle is dictionary compressed, resulting in a low memory footprint.
DataHandles are especially useful for graphical tools, because they can be utilized to compare data, while making sure that their representations are always in sync (i.e., the i-th row of the output dataset represents the transformation of the i-th row in the input dataset).
Important methods include:
handle.getNumRows();
handle.getNumColumns();
handle.getValue(int row, int column);
handle.sort();
handle.swap(int row1, int row2);
handle.getAttributeName(int column);
Note that the data handles do only represent the actual datasets and do not include any information about the attribute types or generalization hierarchies. A data handle for the input dataset can therefore be obtained at any time. It will remain valid, even when the DataDefinition is altered in subsequent steps.
6. Writing data
***************
The classes DataHandle and Hierarchy provide methods to store them in a CSV representation:
Hierarchy/DataHandle.save(String path, char separator);
Hierarchy/DataHandle.save(File file, char separator);
Hierarchy/DataHandle.save(OutputStream out, char separator);