DExter: Learning and Controlling Performance Expression with Diffusion Models

Code repository for DExter: a Diffusion-based Expressive performance generat(o)r, where we show samples of conditional and unconditioned rendering with perceptual-inspired features, as well as controlled ablation studies. The name also echos “dexterity”, one of the crucial qualities of human masters’ hands that enables the most fine-grained control over individual notes, which our models strive to achieve as well.

Table of Content

DExter: Learning and Controlling Performance Expression with Diffusion Models
Table of Content
Installation
Rendering on custom score
Training
Testing
Evaluation

Installation

This repo is developed using python==3.9.7, so it is recommended to use python>=3.9.7.

To install all dependencies

pip install -r requirements.txt

Partitura versioning: This project use a slightly modified version of performance codec, and please install my fork of partitura where this branch is dedicated to this project.

Rendering on custom score

Inference on custom score can be done either in commandline or in the Colab notebook. Note that, current inference script only support the mid-level-condition-free (i.e. only use the given score as conditioning) since mid-level features are only obtained for our dataset with audio tracks. Rendering with inferred mid-level features will come soon.

python inference.py score_path="/path/to/your/musicxml" pretrained_path='/path/to/checkpoint' output_path='/your/output/path'

Training

Dataset and Data processing

Three score-performance aligned datasets are used in this project:

ATEPP: https://github.com/BetsyTang/ATEPP
- The above link doesn't contain score-performance aligned match file. Download dataset with parangonar alignment computed: https://drive.google.com/file/d/1YsZC_uvaOomIUW0RxqkB_8DNFQlkVl-A/
ASAP: https://github.com/CPJKU/asap-dataset
VIENNA422: https://github.com/CPJKU/vienna4x22

Precomputing codec

The following script converts the scores and performances in the datasets into performance codec p_codec and score codec s_codec. Output will be saved in data/. Notice that different MAX_NOTE_SEQ will lead to different truncation of snote_ids and saved separately.

python prepare_data.py --compute_codec --MAX_NOTE_SEQ=100 --mixup

MAX_NOTE_LEN: The length of note sequence to split.
BASE_DIR: base directory to save the output. The output would be saved as BASE_DIR/codec_N={max_note_len}.npy
mixup: whether mixup augmentation is used. Our mixup strategy takes every pair of interpretations and average the p_codec. This roughly scales the amount of data by 10 times.

For pre-computed codec, please download and unzip. It contains the codec and snote_ids of MAX_NOTE_LEN=100, 300, 1000.

Before training, put the pre-computed codec in data under the root folder. However, for testing and output decoded performance, you will need the originl score XML from the 3 datasets. Please download them and put them under Dataset in the same level as root folder (we need the score to decode performance).

The c_codec, derived from the mid-level perceptual features proposed in this paper, captures the perceptual expressiveness of the audio data and can be used for conditioning. The c_codec is precomputed for all audio data for the set we used.

Transfer pairing

For transfer training or inference, we need to pair up two performances from the same piece. The computation goes through all precomputed codec and find the ones that belongs to the same composition and same segment, and create pairs. Depending on the number of pairs requested (K), the function return two lists, paired and unpaired data. Note that mixuped interpolation data is of course not included in the pairing as they are not real performances, instead they go into the unpaired list.

python prepare_data.py --pairing  --K=2374872

K: The number of pairs to generate. For the most 2374872 pairs can be found (segment level).
BASE_DIR: base directory to save the output. The output would be two numpy list saved as BASE_DIR/codec_N={N}_mixup_paired_K={K}.npy and BASE_DIR/codec_N={N}_mixup_unpaired_K={K}.npy

Supervised training with conditioning

python train.py gpus=[0] task.timestep=1000 --train_target='gen_noise'

gpus sets which GPU to use. gpus=[k] means device='cuda:k', gpus=2 means DistributedDataParallel (DDP) is used with two GPUs.
timesteps set the number of denoising timesteps.
For a full list of parameters please refer to train.yaml and task/classifierfree_diffusion.yaml
train_target: 'gen_noise' or 'transfer'. 'gen_noise' works within the diffusion framework and sample p_codec from N(0, 1). However transfer is a diffusion-like strategy (as the Gaussian distribution assumption is gone) that goes from one interpretation to another conditioned on the perceptual features (inspired from this paper). Note that these two different training features different use of: 1. data: The former can be trained on individual data but the later has to use pair in training. 2. c_codec: In the former c_codec is used as a standalone condition, but in the later the c_codec is conditioned as the difference between two interpretations (tgt-src), acting as a notch.

The checkpoints will be output to artifacts/checkpoint/

Testing

First, open config/train.yaml, and then specify the weight to use in pretrained_path, for example pretrained_path='artifacts/checkpoint/len300-beta0.02-steps1500-x_0-L15-C512-cfdg_ddpm_x0-w=0-p=0.1-k=3-dia=2-4/1244e-diffusion_loss0.03.ckpt'. Or you can specify in the command line.

python train.py gpus=[0] test_only=True load_trained=True task.transfer=True task.sample_steps_frac=0.75

test.transfer sets the task to transfer from a . Currently only supports the paired testing set.
test.sample_steps_frac is the depth of steps to noisify the source performance and respectively the steps to denoise sampling.

During testing, the following will be generated / evaluated:

Sampling the testing set data.
Render the samples. Output into artifacts/samples/
Animation of the predicted p_codec from noise
Tempo curve and velocity curve. If WANDB_DISALBED=False this will be uploaded to the wandb workspace.
Comparison between the pred p_codec and label p_codec.

Evaluation

For the full list of assessed attributes please refer to the performance features.

python train.py gpus=[0] --renderer='gen_noise'

renderer is the renderer selected, can be external renderer for comparison.
For a full list of parameters please refer to evaluate.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DExter: Learning and Controlling Performance Expression with Diffusion Models

Table of Content

Installation

Rendering on custom score

Training

Dataset and Data processing

Precomputing codec

Transfer pairing

Supervised training with conditioning

Testing

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

DExter: Learning and Controlling Performance Expression with Diffusion Models

Table of Content

Installation

Rendering on custom score

Training

Dataset and Data processing

Precomputing codec

Transfer pairing

Supervised training with conditioning

Testing

Evaluation