Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. Yet, the lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation.This project contains the source code for programmable dataflows, a programming model for implementing any data sharing problems with a new contract abstraction, allowing people to move towards a common sharing goal without violating any regulatory, privacy, or preference constraints. The programming model is implemented on top of an intermediary data escrow. (Link to programmable dataflow paper) (Link to data escrow paper)
We refer to any scenario in which one party wants access to anothers data a data sharing problem. Examples include: advertisers use a data cleanroom to run analysis on joint data; national patient registry shares medical data with researchers for causal discovery; banks pool data to train joint fraud detection model.
Consider the following data sharing problem: a few banks are interested in pooling credit card transaction data to train more accurate fraud detection models, subject to the following constraints. 1) It is not commercially interesting to pool their data, unless every bank has the guarantee that the joint model benefits themselves, instead of only helping others. 2) The shared data should only be used for model training, and nothing else.
A main challenge of data sharing is that people lack information to assess whether a dataflow is desirable before it takes place. For example, the banks only want to release a joint model if it meets some accuracy threshold, but they have no way of ensuring that without sharing the data to train the model. Additionally, banks have no guarantee that the other banks will only use the data for model training, once they share the raw data. To preclude adverse consequences, many people default to not sharing.
We introduce a new contract abstraction that bounds the consequences of each dataflow by making it explicit who contributes data, what computation takes place on that data, who receives the result, and under what conditions. Importantly, it provides this information before an intended dataflow takes place, thus addressing the challenge by helping agents make an informed decision on whether to allow the dataflow. The programming model implements the contract abstraction, enabling people to solve any data sharing problem through a sequence of contract propositions, approvals, and executions.
Clone the repo.
git clone https://github.com/TheDataStation/DataStation.git
Run the following command from the root directory to install the necessary packages.
pip install -r requirements.txt
Create the needed directories
mkdir SM_storage SM_storage_mount
Here is the code to run a simple data sharing application: share the schema of a csv file with others.
Use the following configs in data_station_config.yaml
cpm_path: "example_cpm/share_schema_app.py"
trust_mode: "full_trust"
in_development: True
Execute the script that contains the example application.
python3 -m integration_new.general_full_trust
Alternatively, to access the application through a web UI (FastAPI):
python3 -m server.fastapi_server
Ensure that you have Docker enabled on your machine.
Start Docker on macOS:
open -a Docker
Start Docker on linux:
sudo systemctl start docker
sudo chmod 666 /var/run/docker.sock
Use the following configs in data_station_config.yaml
in_development: False
This work was supported by the National Science Foundation (NSF) under No. 2040718.