Skip to content

Commit

Permalink
Feat: Add Skeleton to support Dataset anonymization with differential…
Browse files Browse the repository at this point in the history
… privacy (#455)

**DP-Anonymization for BenchBase**

In order to measure the influence of modern anonymization techniques on
datasets and their query performance, we would like to add support for
differential privacy mechanisms to BenchBase. In this PR, we have added
the skeleton of our approach to the codebase. We would like to introduce
a new flag: `--anonymize=true` to the system that allows to anonymize
tables after the loading step and before execution.

The anonymization will:

1. Pull data from the DBMS via already specified JDBC connection
(config)
2. Run DP-algorithms on the data to create a new, synthetic dataset
3. Push the synthetic data back to the DBMS as an anonymized copy

The anonymization information must be provided in the config file. The
process will work with minimal information but also allow for
fine-tuning. A separate README file has been constructed that will list
all the features and how to use them.
`/scripts/anonymization/README.md`

Minimal config:
```
 <anonymization>
        <table name="item">
          <differential_privacy />
        </table>
   </anonymization>
```
Sensitive value handling is one feature we want to add to the process
immediately. It replaces actual values of specified columns with fake
ones. The code base has already been written, tested and used privately
within BenchBase.

The column faking approach will be decoupled from differential privacy,
to allow for more control.
```
 <anonymization>
        <table name="item">
          <differential_privacy> ... </differential_privacy>
          <value_faking> ... </value_faking>
        </table>
   </anonymization>
```

**Disclaimer: The anonymization itself is not part of this PR in order
to reduce the complexity. Currently, the anonymization flag will call
the script and parse the config. The rest of the code is
ready-to-be-added**

[Architecture
Benchbase.pdf](https://github.com/cmu-db/benchbase/files/14356135/Architecture.Benchbase.pdf)

---------

Co-authored-by: Brian Kroth <[email protected]>
  • Loading branch information
ETHenzlere and bpkroth authored Jun 17, 2024
1 parent 98ccd4d commit 422a573
Show file tree
Hide file tree
Showing 14 changed files with 1,152 additions and 2 deletions.
25 changes: 24 additions & 1 deletion .github/workflows/maven.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ env:
POM_VERSION: 2023-SNAPSHOT
JAVA_VERSION: 21
ERRORS_THRESHOLD: 0.01
PYTHON_VERSION: "3.10"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.event_name }}
Expand Down Expand Up @@ -265,7 +266,7 @@ jobs:
strategy:
fail-fast: false
matrix:
benchmark: [ 'auctionmark', 'chbenchmark', 'epinions', 'hyadapt', 'noop', 'otmetrics', 'resourcestresser', 'seats', 'sibench', 'smallbank', 'tatp', 'templated', 'tpcc', 'tpcc-with-reconnects', 'tpch', 'twitter', 'voter', 'wikipedia', 'ycsb' ]
benchmark: [ 'anonymization', 'auctionmark', 'chbenchmark', 'epinions', 'hyadapt', 'noop', 'otmetrics', 'resourcestresser', 'seats', 'sibench', 'smallbank', 'tatp', 'templated', 'tpcc', 'tpcc-with-reconnects', 'tpch', 'twitter', 'voter', 'wikipedia', 'ycsb' ]
services:
mysql: # https://hub.docker.com/_/mysql
image: mysql:latest
Expand Down Expand Up @@ -301,6 +302,21 @@ jobs:
java-version: ${{env.JAVA_VERSION}}
distribution: 'temurin'

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{env.PYTHON_VERSION}}

- name: Install Python dependencies
working-directory: ./scripts/anonymization
run: |
if [[ ${{matrix.benchmark}} == anonymization ]]; then
python -m pip install --upgrade pip
pip install -r requirements.txt
else
echo "Dependency installation not necessary for benchmark"
fi
- name: Run benchmark
env:
MYSQL_PORT: ${{ job.services.mysql.ports[3306] }}
Expand All @@ -312,6 +328,13 @@ jobs:
if [[ ${{matrix.benchmark}} == templated ]]; then
java -jar benchbase.jar -b tpcc -c config/mysql/sample_tpcc_config.xml --create=true --load=true --execute=false --json-histograms results/histograms.json
java -jar benchbase.jar -b ${{matrix.benchmark}} -c config/mysql/sample_${{matrix.benchmark}}_config.xml --create=false --load=false --execute=true --json-histograms results/histograms.json
# For anonymization, we load tpcc and anonymize a single table. The workload itself is not executed
# FIXME: 'exit 0' is called because there is no benchmark executed and analyzed. Must be removed once the Anonymization script is
# fully implemented. See Pull Request 455.
elif [[ ${{matrix.benchmark}} == anonymization ]]; then
java -jar benchbase.jar -b tpcc -c config/mysql/sample_tpcc_config.xml --create=true --load=true --execute=false --json-histograms results/histograms.json
java -jar benchbase.jar -b tpcc -c config/mysql/sample_${{matrix.benchmark}}_config.xml --anonymize=true
exit 0
elif [[ ${{matrix.benchmark}} == tpcc-with-reconnects ]]; then
# See Also: WITH_SERVICE_INTERRUPTIONS=true docker/build-run-benchmark-with-docker.sh
java -jar benchbase.jar -b tpcc -c config/mysql/sample_tpcc_config.xml --create=true --load=true
Expand Down
34 changes: 34 additions & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: BenchBase (Python)

on:
push:
branches: [ main ]
tags:
- 'v*'
pull_request:
branches: [ main ]

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
working-directory: ./scripts/anonymization
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Check anonymization files with pylint
run: |
pylint --rcfile=.pylintrc ./scripts/anonymization/src
- name: Test anonymization with pytest
working-directory: ./scripts/anonymization/src
run: pytest test.py
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,8 @@ build/
.*.swp

.env

docker-compose-*.tar.gz

# Python
__pycache__/
*.py[cod]
45 changes: 45 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
[MAIN]

# Specify a score threshold under which the program will exit with error.
fail-under=0.9

[DESIGN]

# Maximum number of arguments for function / method.
max-args=9

# Maximum number of locals for function / method body.
max-locals=20

# Maximum number of return / yield for function / method body.
max-returns=6

# Maximum number of statements in function / method body.
max-statements=50

# Minimum number of public methods for a class (see R0903).
min-public-methods=0


[FORMAT]

# Maximum number of characters on a single line.
max-line-length=120

# Maximum number of lines in a module.
max-module-lines=1000

[REPORTS]

# Python expression which should return a score less than or equal to 10. You
# have access to the variables 'fatal', 'error', 'warning', 'refactor',
# 'convention', and 'info' which contain the number of messages in each
# category, as well as 'statement' which is the total number of statements
# analyzed. This score is used by the global evaluation report (RP0004).
evaluation=max(0, 0 if fatal else 10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10))

# Tells whether to display a full report or only the messages.
reports=no

# Activate the evaluation score.
score=yes
42 changes: 42 additions & 0 deletions config/mysql/sample_anonymization_config.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<?xml version="1.0"?>
<parameters>

<!-- Connection details -->
<type>MYSQL</type>
<driver>com.mysql.cj.jdbc.Driver</driver>
<url>jdbc:mysql://localhost:3306/benchbase?rewriteBatchedStatements=true&amp;allowPublicKeyRetrieval=True&amp;sslMode=DISABLED</url>
<username>admin</username>
<password>password</password>
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
<isolation>TRANSACTION_SERIALIZABLE</isolation>
<batchsize>128</batchsize>

<!-- Note: this example anonymizes the "item" table of the tpcc workload.
To run, use the `anonymize=true` flag
-->

<!-- The anonymization configuration -->
<anonymization>
<table name="item">
<differential_privacy epsilon="1.0" pre_epsilon="0.0" algorithm="mst">
<!-- Column categorization -->
<ignore>
<column name="i_id"/>
<column name="i_data" />
<column name="i_im_id" />
</ignore>
<categorical>
<column name="i_name" />
</categorical>
<!-- Continuous column fine-tuning -->
<continuous>
<column name="i_price" bins="1000" lower="2.0" upper="100.0" />
</continuous>
</differential_privacy>
<!-- Sensitive value handling -->
<value_faking>
<column name="i_name" method="name" locales="en_US" seed="0"/>
</value_faking>
</table>
</anonymization>
</parameters>
Loading

0 comments on commit 422a573

Please sign in to comment.