Skip to content

MagMueller/Synthetic-Data-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Data Generation Pipeline

This repository contains a comprehensive pipeline for generating and evaluating synthetic data using various state-of-the-art models, including privacy-preserving approaches.

Report & Results

For more details on the methodology and results, see our report: Synthetic Data Generation overleaf report

Project Structure

Core Pipeline (notebooks/pipeline/main.py)

The main pipeline implementation that handles:

  • Data loading and preprocessing
  • Model training and evaluation
  • Synthetic data generation
  • Quality and privacy metrics evaluation
  • Results visualization and logging

Supported models:

  • CTGAN
  • TVAE
  • GReaT (with LoRA fine-tuning)
  • GaussianCopula
  • CopulaGAN
  • Privacy-preserving models:
    • PATE-CTGAN
    • DP-CTGAN

Exploration & Development

  • notebooks/GReaT_benchmark.ipynb: Initial experiments with GReaT model
  • notebooks/dp_models/dp_model_benchmark.ipynb: Development and testing of differential privacy models
  • eval.ipynb: Evaluation and visualization of model outputs

Installation

  1. Create and activate virtual environment:
python -m venv env
source env/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

Run the pipeline with different configurations:

python -m notebooks.pipeline.main --experiment_name default_run

Features

  • Modular architecture supporting multiple synthetic data generation models
  • Comprehensive evaluation metrics for quality and privacy
  • Integration with Weights & Biases for experiment tracking
  • Automated visualization of results
  • Support for both standard and privacy-preserving models

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published