Skip to content

Commit

Permalink
Grammatical review comments addressed
Browse files Browse the repository at this point in the history
  • Loading branch information
JPieper-NTIA committed Jul 8, 2024
1 parent 816f2b9 commit 0e38aa2
Show file tree
Hide file tree
Showing 8 changed files with 75 additions and 76 deletions.
40 changes: 20 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Dataset Alignment
This code corresponds to the paper "AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators" Jaden Pieper, Steve Voran, to appear in Proc. Interspeech 2024 and with [preprint available here](https://arxiv.org/abs/2406.10205).
This code corresponds to the paper "AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators," by Jaden Pieper, Stephen D. Voran, to appear in Proc. Interspeech 2024 and with [preprint available here](https://arxiv.org/abs/2406.10205).

When training a no-reference (NR) speech quality estimator, multiple datasets provide more information and can thus lead to better training. But they often are inconsistent in the sense that they use different subjective testing scales, or the exact same scale is used differently by test subjects due to the corpus effect.
AlignNet improves the training of NR speech quality estimators with multiple, independent datasets. AlignNet uses an AudioNet to generate intermediate score estimates before using the Aligner to map intermediate estimates to the appropriate score range.
AlignNet is intentionally designed to be independent of the choice of AudioNet.

This repository contains implementations of two different AudioNet choices: [MOSNet](https://arxiv.org/abs/1904.08352) and a simple example of a novel multi-scale convolution approach.

MOSNet demonstrates a network that takes the STFT of an audio signal as its input and the multi-scale convolution network is provided primarily as an example of a network that takes raw audio as an input.
MOSNet demonstrates a network that takes the STFT of an audio signal as its input, and the multi-scale convolution network is provided primarily as an example of a network that takes raw audio as an input.

# Installation
## Dependencies
There are two included environment files. `environment.yml` has the dependencies required to train with alignnet, but does not impose version requirements. It is thus susceptible to issues in the future if packages deprecate methods or have major backwards compatibility breaks. On the otherhand `environment-paper.yml` contains the exact versions of the packages that were used for all the results reported in our paper.
There are two included environment files. `environment.yml` has the dependencies required to train with alignnet but does not impose version requirements. It is thus susceptible to issues in the future if packages deprecate methods or have major backwards compatibility breaks. On the other hand, `environment-paper.yml` contains the exact versions of the packages that were used for all the results reported in our paper.

Create and activate the `alignnet` environment.
```
Expand All @@ -25,15 +25,15 @@ pip install .
```

# Preparing data for training
When training with multiple datasets some work must first be done to format them in a consistent manner so they can all be loaded in the same way.
For each dataset one must first make a csv that has subjective score in column called `MOS` and path to audio file in column called `audio_path`.
When training with multiple datasets, some work must first be done to format them in a consistent manner so they can all be loaded in the same way.
For each dataset, one must first make a csv that has subjective score in column called `MOS` and path to audio file in column called `audio_path`.

If your `audio_net` model requires transformed data you can transform it prior to training with `pretransform_data.py` (see `python pretransform_data.py --help` for more information) and store paths to those transformed representation files in a column called `transform_path`. For example MOSNet uses the STFT of audio as an input. For more efficient training, pretransforming the audio into STFT representations, saving them, and including a column called `stft_path` in the csv is recommended.
If your `audio_net` model requires transformed data, you can transform it prior to training with `pretransform_data.py` (see `python pretransform_data.py --help` for more information) and store paths to those transformed representation files in a column called `transform_path`. For example, MOSNet uses the STFT of audio as an input. For more efficient training, pretransforming the audio into STFT representations, saving them, and including a column called `stft_path` in the csv is recommended.
More generally, the column name must match the value of `data.pathcol`.
For examples see [MOSNet](alignnet/config/models/pretrain-MOSNet.yaml) or [MultiScaleConvolution](alignnet/config/models/pretrain-msc.yaml).
For examples, see [MOSNet](alignnet/config/models/pretrain-MOSNet.yaml) or [MultiScaleConvolution](alignnet/config/models/pretrain-msc.yaml).


For each dataset, slit the data into training, validation, and testing portions with
For each dataset, split the data into training, validation, and testing portions with
```
python split_labeled_data.py /path/to/data/file.csv --output-dir /datasetX/splits/path
```
Expand All @@ -50,16 +50,16 @@ Some basic training help can be found with
python train.py --help
```

To see an example config file and all the overrideable parameters for training MOSNet with AlignNet run
To see an example config file and all the overrideable parameters for training MOSNet with AlignNet, run
```
python train.py --config-dir alignnet/config/models --config-name=alignnet-MOSNet --cfg job
```
Here the `--cfg job` shows the configuration for this job without running the code.

If you are not training with a [clearML](https://clear.ml/) server be sure to set `logging=none`.
If you are not training with a [clearML](https://clear.ml/) server, be sure to set `logging=none`.
To change the number of workers used for data loading, override the `data.num_workers` parameter, which defaults to 6.

As an example and to confirm you have appropriately overridden these parameters you could run
As an example, and to confirm you have appropriately overridden these parameters, you could run
```
python train.py logging=none data.num_workers=4 --config-dir alignnet/config/models --config-name=alignnet-MOSNet --cfg job
```
Expand Down Expand Up @@ -108,7 +108,7 @@ finetune.restore_file=/absolute/path/to/alignnet/trained_models/pretrained-MOSNe

## MultiScaleConvolution example
Training NR speech estimators with AlignNet is intentionally designed to be agnostic to the choice of AudioNet.
To demonstrate this we include code for a rudimentary network that takes raw audio in as an input and trains separate convolutional networks on multiple time scales that are then aggregated into a single network component.
To demonstrate this, we include code for a rudimentary network that takes in raw audio as an input and trains separate convolutional networks on multiple time scales that are then aggregated into a single network component.
This network is defined as `alignnet.MultiScaleConvolution` and can be trained via:
```
python path/to/alignnet/train.py \
Expand All @@ -123,18 +123,18 @@ Some basic help can be seen via
python inference.py --help
```

In general three overrides must be set:
In general, three overrides must be set:
* `model.path` - path to a trained model
* `data.data_files` - list containing absolute paths to csv files that list audio files to perform inference on.
* `output.file` - path to file where inference output will be stored.

After running inference a csv will be created at `output.file` with the following columns:
After running inference, a csv will be created at `output.file` with the following columns:
* `file` - filenames where audio was loaded from
* `estimate` - estimate generated by the model
* `dataset` - index for which file from `data.data_files` this file belongs to.
* `AlignNet dataset index` - index for which dataset within the model the scores come from. This will be the same for every file in the csv. The default dataset will always be the reference dataset but this can be overriden via `model.dataset_index`.
* `dataset` - index listing which file from `data.data_files` this file belongs to.
* `AlignNet dataset index` - index listing which dataset within the model the scores come from. This will be the same for every file in the csv. The default dataset will always be the reference dataset, but this can be overriden via `model.dataset_index`.

For example, to run inference using the included AlignNet model trained on the smaller datasets one would run
For example, to run inference using the included AlignNet model trained on the smaller datasets, one would run
```
python inference.py \
data.data_files=[/absolute/path/to/inference/data1.csv,/absolute/path/to/inference/data2.csv] \
Expand All @@ -144,13 +144,13 @@ output.file=estimations.csv


# Gathering datasets used in 2024 Conference Paper
Here are links and reference to help with locating the data we have used in the paper.
Here are links and references to help with locating the data we have used in the paper.

* [Blizzard 2021](https://www.cstr.ed.ac.uk/projects/blizzard/data.html)
* Z.-H. Ling, X. Zhou, and S. King, "The Blizzard challenge 2021," in Proc. Blizzard Challenge Workshop, 2021.
* [Blizzard 2008](https://www.cstr.ed.ac.uk/projects/blizzard/data.html)
* V. Karaiskos, S. King, R. A. J. Clark, and C. Mayo, "The Blizzard challenge 2008," in Proc. Blizzard Challenge Workshop, 2008.
* [FFTnet](https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/clips/)
* [FFTNet](https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/clips/)
* Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, "FFTNet: a real-time speaker-dependent neural vocoder," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
* [NOIZEUS](https://ecs.utdallas.edu/loizou/speech/noizeus/)
* Y. Hu and P. Loizou, "Subjective comparison of speech enhancement algorithms," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2006.
Expand All @@ -159,7 +159,7 @@ Here are links and reference to help with locating the data we have used in the
* [Tencent](https://github.com/ConferencingSpeech/ConferencingSpeech2022)
* G. Yi, W. Xiao, Y. Xiao, B. Naderi, S. Moller, W. Wardah, G. Mittag, R. Cutler, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, "ConferencingSpeech 2022 Challenge: Non-intrusive objective speech quality assessment challenge for online conferencing applications," in Proc. Interspeech, 2022, pp. 3308–3312.
* [NISQA](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus)
* G. Mittag, B. Naderi, A. Chehadi, and S. M ̈oller, "NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.
* G. Mittag, B. Naderi, A. Chehadi, and S. Möller, "NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.
* [Voice Conversion Challenge 2018](https://datashare.ed.ac.uk/handle/10283/3257)
* J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Speaker Odyssey, 2018.
* [Indiana U. MOS](https://github.com/ConferencingSpeech/ConferencingSpeech2022)
Expand Down
2 changes: 1 addition & 1 deletion alignnet/config/hydra/help/train_help.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ app_name: AlignNet
header: == Training ${hydra.help.app_name} ==

footer: |-
Powered by Hydra (https://hyrda.cc)
Powered by Hydra (https://hydra.cc)
Use --hydra-help to view Hydra specific help.
template: |-
Expand Down
16 changes: 8 additions & 8 deletions alignnet/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ def padding(self, batch):

# Concatenate into one tensor
audio_out = torch.stack(audio_out, dim=0)
# If a transform is defined and the transform time is at collate now is the time to apply it
# If a transform is defined and the transform time is at collate, now is the time to apply it
if self.transform is not None and self.transform_time == "collate":
audio_out = self.transform.transform(audio_out)
audio_out = torch.unsqueeze(audio_out, dim=1)
Expand All @@ -179,7 +179,7 @@ def padding(self, batch):

class FeatureData(AudioData):
"""
For loading pre-computed features for audio files. Only the __getitem__ method needs to change
For loading pre-computed features for audio files. Only the __getitem__ method is changed
"""

def __init__(
Expand Down Expand Up @@ -229,7 +229,7 @@ def __getitem__(self, idx):
audio = self.wavs[audio_path]
else:
fname, ext = os.path.splitext(audio_path)
# If using same split csvs as audio this may say wav and not pt
# If using same split csvs as audio, this may say wav and not pt
# (coming out of pretransform_data.py will save as pt)
if ext == ".wav":
audio_path = fname + ".pkl"
Expand Down Expand Up @@ -296,7 +296,7 @@ def __init__(
"""
super().__init__()

# If this class sees batch_size=auto it sets to default value and assumes a Tuner is being called in the main
# If this class sees batch_size=auto, it sets to default value and assumes a Tuner is being called in the main
# logic to update this later
if batch_size == "auto":
batch_size = 32
Expand All @@ -313,11 +313,11 @@ def setup(self, stage: str):
"""
Load different datasubsets depending on stage.
If stage == 'fit' then train, valid, and test data are loaded.
If stage == 'fit', then train, valid, and test data are loaded.
If stage == 'test' then only test data is loaded.
If stage == 'test', then only test data is loaded.
If stage == 'predict' then self.data_dirs should be full paths to the specific
If stage == 'predict', then self.data_dirs should be full paths to the specific
csv files to run predictions on.
Parameters
Expand Down Expand Up @@ -382,7 +382,7 @@ def find_datasubsets(self, data_paths, subset):

def find_datasubset(self, data_path, subset):
"""
Helper function for setup to find the different data subsets (test/train/valid)
Helper function for setup to find the different data subsets (train/valid/test)
Parameters
----------
Expand Down
Loading

0 comments on commit 0e38aa2

Please sign in to comment.