Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: CutSet has cuts with duplicated IDs. #1850

Open
mukherjeesougata opened this issue Dec 28, 2024 · 4 comments
Open

AssertionError: CutSet has cuts with duplicated IDs. #1850

mukherjeesougata opened this issue Dec 28, 2024 · 4 comments

Comments

@mukherjeesougata
Copy link

mukherjeesougata commented Dec 28, 2024

I am trying to run Zipformer model using my custom dataset. For that the steps that I have followed are given below:-

  1. I have prepared the data by running the command lhotse kaldi import {train, dev, test}/ 16000 manifests/{train, dev, test}_manifest.

  2. I have completed the fbank extraction stage (stage 3) of prepare.sh script. which generated the following files and folders which is shown in the figure below:-
    Zipformer_fbank_Kui

  3. After this I have prepared BPE based lang which generated the folder lang_bpe_500 containing bpe.model, tokens.txt, transcript_word.txt, unigram_500.model, unigram_500.vocab files

  4. Finally I have run the CLI which is given below:-
    ./pruned_transducer_stateless7_streaming/train.py --world-size 2 --num-epochs 30 --start-epoch 1 --use-fp16 1 --exp-dir pruned_transducer_stateless7_streaming/exp --max-duration 200 --enable-musan False

I am getting the following error:-

Traceback (most recent call last):
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1273, in <module>
    main()
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1264, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 1144, in run
    train_one_epoch(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 915, in train_one_epoch
    valid_info = compute_validation_loss(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/pruned_transducer_stateless7_streaming/train.py", line 737, in compute_validation_loss
    for batch_idx, batch in enumerate(valid_dl):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 53, in fetch
    data = self.dataset[possibly_batched_index]
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 99, in __getitem__
    validate_for_asr(cuts)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/dataset/speech_recognition.py", line 205, in validate_for_asr
    validate(cuts)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/qa.py", line 39, in validate
    validator(obj, read_data=read_data)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/lhotse/qa.py", line 512, in validate_cut_set
    assert ids.most_common(1)[0][1] <= 1, "CutSet has cuts with duplicated IDs."
AssertionError: CutSet has cuts with duplicated IDs.
@mukherjeesougata
Copy link
Author

I have also tried it for another dataset. It was giving me the following error after running the CLI as mentioned in point 4.

  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1273, in <module>
    main()
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/./pruned_transducer_stateless7_streaming/train.py", line 1264, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 1144, in run
    train_one_epoch(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 814, in train_one_epoch
    loss, loss_info = compute_loss(
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/train.py", line 685, in compute_loss
    simple_loss, pruned_loss = model(
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/DATA/anaconda3/envs/icefall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/DATA/Sougata/icefall_toolkit/icefall/egs/Hindi/ASR/pruned_transducer_stateless7_streaming/model.py", line 121, in forward
    assert torch.all(x_lens > 0)
AssertionError

@hosythach-jelly
Copy link

Hi @mukherjeesougata,
I think you should delete files with duplicate IDs before creating manifest.

@JinZr
Copy link
Collaborator

JinZr commented Jan 7, 2025

hi,

since you are directly importing a kaldi-fmt data dir, you are suggested to use utils/fix_data_dir.sh (of sorts, i cannot recall the exact name of the script at the time) to remove entries with duplicated keys to begin with.

best
jin

@mukherjeesougata
Copy link
Author

mukherjeesougata commented Jan 11, 2025

I have already used utils/fix_data_dir.sh script to sort the train, dev, and test folders which contained text, wav.scp and utt2spk text files to remove duplicates. In addition to this, I have used the following code to find duplicates ids from the Kui_cuts_train.jsonl, Kui_cuts_dev.jsonl and Kui_cuts_test.jsonl :-

# Re-importing necessary libraries and re-executing the task due to reset state.
import json
from collections import Counter

# File path after state reset
file_path = '/DATA/Sougata/icefall_toolkit/icefall/egs/Kui/ASR/data/unzipped_files/Kui_cuts_test.jsonl'

# Reading the JSONL file and extracting IDs
ids = []
with open(file_path, 'r') as file:
    for line in file:
        data = json.loads(line)
        if 'id' in data:
            #print('id',data['id'])
            ids.append(data['id'])

# Identifying duplicate IDs
id_counts = Counter(ids)
print(id_counts)
duplicates = [id_ for id_, count in id_counts.items() if count > 1]

print(duplicates, len(duplicates))

The above code is giving the o/p [] 0 which means the files Kui_cuts_train.jsonl, Kui_cuts_dev.jsonl and Kui_cuts_test.jsonl doesn't have any duplicate ids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants