Closes #17

nhsengland · Nov 3, 2023 · b041ccb · b041ccb
1 parent 39a5636
commit b041ccb
Show file tree

Hide file tree

Showing 8 changed files with 271 additions and 89 deletions.
diff --git a/docs/downstream_tasks.md b/docs/downstream_tasks.md
@@ -0,0 +1,57 @@
+# Defining a downstream task
+
+It is likely that a synthetic dataset may be associated with specific modelling efforts or metrics that are not included in the general suite of evaluation tools supported more explicitly by this package. Additionally, analyses on model outputs for bias and fairness provided via [Aequitas](http://aequitas.dssg.io) require some basis of predictions on which to perform the analysis. For these reasons, we provide a simple interface for defining a custom downstream task.
+
+All downstream tasks are to be located in a folder named `tasks` in the working directory of the project, with subfolders for each dataset, i.e. the tasks associated with the `support` dataset should be located in the `tasks/support` directory.
+
+The interface is then quite simple:
+
+- There should be a function called `run` that takes a single argument: `dataset` (additional arguments could be provided with some further configuration if there is a need for this)
+- The `run` function should fit a model and / or calculate some metric(s) on the dataset.
+- It should then return predicted probabilities for the outcome variable(s) in the dataset and a dictionary of metrics.
+- The file should contain a top-level variable containing an instantiation of the `nhssynth` `Task` class.
+
+See the example below of a logistic regression model fit on the `support` dataset with the `event` variable as the outcome and `rocauc` as the metric of interest:
+
+```python hl_lines="7 10 28 31"
+import pandas as pd
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import roc_auc_score
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+
+from nhssynth.modules.evaluation.tasks import Task
+
+
+def run(dataset: pd.DataFrame) -> tuple[pd.DataFrame, dict]:
+    # Split the dataset into features and target
+    target = "event"
+
+    data = dataset.dropna()
+    X, y = data.drop(["dob", "x3", target], axis=1), data[target]
+    X_train, X_test, y_train, y_test = train_test_split(
+        StandardScaler().fit_transform(X), y, test_size=0.33, random_state=42
+    )
+
+    lr = LogisticRegression()
+    lr.fit(X_train, y_train)
+
+    # Get the predicted probabilities and predictions
+    probs = pd.DataFrame(lr.predict_proba(X_test)[:, 1], columns=[f"lr_{target}_prob"])
+
+    rocauc = roc_auc_score(y_test, probs)
+
+    return probs, {"rocauc_lr": rocauc}
+
+
+task = Task("Logistic Regression on 'event'", run, supports_aequitas=True)
+```
+
+Note the highlighted lines above:
+
+1. The `Task` class has been imported from `nhssynth.modules.evaluations.tasks`
+2. The `run` function should accept one argument and return a tuple
+3. The second element of this tuple should be a dictionary labelling each metric of interest (this name will be used in the dashboard as identification so ensure it is unique to the experiment)
+4. The `task` should be instantiated with a name, the `run` function and a boolean indicating whether the task supports Aequitas analysis, if the task does *not* support Aequitas analysis, then the first element of the tuple will not be used and `None` can be returned instead.
+
+The rest of this file can contain any arbitrary code that runs within these constraints, this could be a simple model as above, or a more complex pipeline of transformations and models to match a pre-existing workflow.
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -0,0 +1,191 @@
+# Getting Started
+
+## Running an experiment
+
+This package offers two easy ways to run reproducible and highly-configurable experiments. The following sections describe how to use each of these two methods.
+
+### Via the CLI
+
+The CLI is the easiest way to quickly run an experiment. It is designed to be as simple as possible, whilst still offering a high degree of configurability. An example command to run a full pipeline experiment is:
+
+```bash
+nhssynth pipeline \
+    --experiment-name test \
+    --dataset support \
+    --seed 123 \
+    --architecture DPVAE PATEGAN DECAF \
+    --repeats 3 \
+    --downstream-tasks \
+    --column-similarity-metrics CorrelationSimilarity ContingencySimilarity \
+    --column-shape-metrics KSComplement TVComplement \
+    --boundary-metrics BoundaryAdherence \
+    --synthesis-metrics NewRowSynthesis \
+    --divergence-metrics ContinuousKLDivergence DiscreteKLDivergence
+```
+
+This will run a full pipeline experiment on the `support` dataset in the `data` directory. The outputs of the experiment will be recorded in a folder named `test` (corresponding to the experiment name) in the `experiments` directory.
+
+In total, three different model architectures will be trained three times each with their default configurations. The resulting generated synthetic datasets will be evaluated via the downstream tasks in `tasks/support` alongside the metrics specified in the command. A dashboard will then be built automatically to exhibit the results.
+
+The components of the run are persistent to the experiment's folder. Suppose you have already run this experiment and want to add some new evaluations. You do not have to re-run the entire experiment, you can simply run:
+
+```bash
+nhssynth evaluation -e test -d support -s 123 --coverage-metrics RangeCoverage CategoryCoverage
+nhssynth dashboard -e test -d support
+```
+
+This will regenerate the dashboard with a different set of metrics corresponding to the arguments passed to `evaluation`. Note that the `--experiment-name` and `--dataset` arguments are required for all commands, as they are used to identify the experiment and ensure reproducibility.
+
+### Via a configuration file
+
+A `yaml` configuration file placed in the `config` folder can be used to get the same result as the above:
+
+```yaml
+seed: 123
+experiment_name: test
+run_type: pipeline
+model:
+  architecture:
+    - DPVAE
+    - DPGAN
+    - DECAF
+  max_grad_norm: 5.0
+  secure_mode: false
+  repeats: 4
+evaluation:
+  downstream_tasks: true
+  column_shape_metrics:
+  - KSComplement
+  - TVComplement
+  column_similarity_metrics:
+  - CorrelationSimilarity
+  - ContingencySimilarity
+  boundary_metrics:
+  - BoundaryAdherence
+  synthesis_metrics:
+  - NewRowSynthesis
+  divergence_metrics:
+  - ContinuousKLDivergence
+  - DiscreteKLDivergence
+```
+
+Once saved as `run_pipeline.yaml` in the `config` directory, the package can be run under the configuration laid out in the file via:
+
+```bash
+nhssynth config -c run_pipeline
+```
+
+Note that if you run via the [CLI](#via-cli), you can add the `--save-config` flag to your command to save the configuration file in the `experiments/test` (or whatever the `--experiment-name` has been set to) directory. This allows for easy reproduction of an experiment at a later date or on someone else's computer through sharing the configuration file with them.
+
+## Setting up a dataset's metadata
+
+For each dataset you wish to work with, it is advisable to setup a corresponding metadata file. The package will infer this when information is missing (and you can then tweak it). The reason we suggest specifying metadata in this way is because Pandas / Python are in general bad at interpreting CSV files, particularly the specifics of datatypes, date objects and so on.
+
+To do this, we must create a metadata `yaml` file in the dataset's directory. For example, for the `support` dataset, this file is located at `data/support_metadata.yaml`. By default, the package will look for a file with the same name as the dataset in the dataset's directory, but with `_metadata` appended to the end. *This is configurable like most other filenaming conventions via the CLI.*
+
+The metadata file is split into two sections: `columns` and `constraints`. The former specifies the nature of each column in the dataset, whilst the latter specifies any constraints that should be enforced on the dataset.
+
+### Column metadata
+
+Again, we refer to the `support` dataset's metadata file as an example:
+
+```yaml
+columns:
+  dob:
+    dtype:
+      name: datetime64
+      floor: S
+  x1:
+    categorical: true
+    dtype: int64
+  x2:
+    categorical: true
+    dtype: int64
+  x3:
+    categorical: true
+  x4:
+    categorical: true
+    dtype: int64
+  x5:
+    categorical: true
+    dtype: int64
+  x6:
+    categorical: true
+    dtype: int64
+  x7:
+    dtype: int64
+  x8:
+    dtype: float64
+    missingness:
+      impute: mean
+  x9:
+    dtype: int64
+  x10:
+    dtype:
+      name: float64
+      rounding_scheme: 0.1
+  x11:
+    dtype: int64
+  x12:
+    dtype: float64
+  x13:
+    dtype: float64
+  x14:
+    dtype: float64
+  duration:
+    dtype: int64
+  event:
+    categorical: true
+    dtype: int64
+```
+
+For each column in the dataset, we specify the following:
+
+- It's `dtype`, this can be any `numpy` data type or a datetime type.
+  - In the case of a datetime type, we also specify the `floor` (i.e. the smallest unit of time that we care about). In general this should be set to match the smallest unit of time in the dataset.
+  - In the case of a `float` type, we can also specify a `rounding_scheme` to round the values to a certain number of decimal places, again this should be set according to the rounding applied to the column in the real data, or if you want to round the values for some other reason.
+- Whether it is `categorical` or not. If a column is not categorical, you don't need to specify this. A column is inferred as `categorical` if it has less than 10 unique values or is a string type.
+- If the column has missing values, we can specify how to deal with them by specifying a `missingness` strategy. In the case of the `x8` column, we `impute` the missing values with the column's `mean`. If you don't specify this, the CLI or configuration file's specified global missingness strategy will be applied instead (this defaults to the augment strategy which model's the missingness as a separate level in the case of categorical features, or as a separate cluster in the case of continuous features).
+
+### Constraints
+
+The second part of the metadata file specifies any constraints that should be enforced on the dataset. These can be a relative constraint between two columns, or a fixed one via a constant on a single column. For example, the `support` dataset's constraints are as follows (note that these are arbitrarily defined and do not necessarily reflect the real data):
+
+```yaml
+constraints:
+  - "x10 in (0,100)"
+  - "x12 in (0,100)"
+  - "x13 in (0,100)"
+  - "x10 <= x12"
+  - "x12 < x13"
+  - "x10 < x13"
+  - "x8 > x10"
+  - "x8 > x12"
+  - "x8 > x13"
+  - "x11 > 100"
+  - "x12 > 10"
+```
+
+The function of these constraints is fairly self-explanatory: The package ensures the constraints are feasible and minimises them before applying transformations to ensure that they will be satisfied in the synthetic data as well. When a column does not meet a feasible constraint in the real data, we assume that this is intentional and use the violation as a feature upon which to generate synthetic data that also violates the constraint.
+
+There is a further constraint `fixcombo` that only applies to categorical columns. This suggests that only existing combinations of two or more categorical columns should be generated, i.e. the columns can be collapsed into a single composite feature. I.e. if we have a column for pregnancy, and another for sex, we may only want to allow three categories, 'male:not-pregnant', 'female:pregnant', 'female:not-pregnant'. This is specified as follows:
+
+```yaml
+constraints:
+  - "pregnancy fixcombo sex"
+```
+
+In conclusion then, we support the following constraint types:
+
+- `fixcombo` for categorical columns
+- `<` and `<` for non-categorical columns
+- `>=` and `<=` for non-categorical columns
+- `in` for non-categorical columns, which is effectively two of the above constraints combined. I.e. `x in [a, b)` is equivalent to `x >= a and x < b`.
+
+Once this metadata is setup, you are ready to run your experiment.
+
+## Evaluation
+
+Once models have been trained and synthetic datasets generated, we leverage evaluations from [SDMetrics](https://docs.sdv.dev/sdmetrics), [Aequitas](http://aequitas.dssg.io), the NHS' internal [SynAdvSuite](https://github.com/nhsengland/SynAdvSuite) (at current time you must request access to this repository to use the privacy-related attacks it implements), and also offer a facility for the [custom specification of downstream tasks](downstream_tasks.md). These evaluations are then aggregated into a dashboard for ease of comparison and analysis.
+
+See the relevant documentation for each of these packages for more information on the metrics they offer.
diff --git a/docs/index.md b/docs/index.md
@@ -2,6 +2,6 @@
 
 This is a package for generating useful synthetic data, audited and assessed along the dimensions of utility, privacy and fairness. Currently, the main focus of the package in its beta stage is to experiment with different model architectures to find which are the most promising for real-world usage.
 
-See the [User Guide](running_an_experiment.md) to get started with running an experiment with the package.
+See the [User Guide](getting_started.md) to get started with running an experiment with the package.
 
 See the [Development Guide](development_guide.md) and [Code Reference](reference/cli/index.md) to get started with contributing to the package.
diff --git a/docs/models.md b/docs/models.md
@@ -1,3 +1,9 @@
 # Adding new models
 
-The `model` module contains all of the architectures.
+The `model` module contains all of the architectures implemented as part of this package. We offer GAN and VAE based architectures with a number of adjustments to achieve privacy and other augmented functionalities. The module handles the training and generation of synthetic data using these architectures, per a user's choice of model(s) and configuration.
+
+It is likely that as the literature matures, more effective architectures will present themselves as promising for application to the type of tabular data `NHSSynth` is designed for. Below we discuss how to add new models to the package.
+
+## Model design
+
+In general, we view the VAE and (Tabular)GAN implementations in this package as the foundations for other architectures. As such, we try to maintain a somewhat modular design to building up more complex differentially private and so on architectures. Each model inherits from either the `GAN` or `VAE` class ([in files of the same name](https://github.com/nhsengland/NHSSynth/tree/main/src/nhssynth/modules/model/models))