This repository contains a comprehensive pipeline for generating and evaluating synthetic data using various state-of-the-art models, including privacy-preserving approaches.
For more details on the methodology and results, see our report: Synthetic Data Generation overleaf report
The main pipeline implementation that handles:
- Data loading and preprocessing
- Model training and evaluation
- Synthetic data generation
- Quality and privacy metrics evaluation
- Results visualization and logging
Supported models:
- CTGAN
- TVAE
- GReaT (with LoRA fine-tuning)
- GaussianCopula
- CopulaGAN
- Privacy-preserving models:
- PATE-CTGAN
- DP-CTGAN
notebooks/GReaT_benchmark.ipynb
: Initial experiments with GReaT modelnotebooks/dp_models/dp_model_benchmark.ipynb
: Development and testing of differential privacy modelseval.ipynb
: Evaluation and visualization of model outputs
- Create and activate virtual environment:
python -m venv env
source env/bin/activate
- Install dependencies:
pip install -r requirements.txt
Run the pipeline with different configurations:
python -m notebooks.pipeline.main --experiment_name default_run
- Modular architecture supporting multiple synthetic data generation models
- Comprehensive evaluation metrics for quality and privacy
- Integration with Weights & Biases for experiment tracking
- Automated visualization of results
- Support for both standard and privacy-preserving models