-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the format and shape of the data #36
Comments
|
Thank you very much. ex, in a model with 40 categories Does this mean that the last category is being used as the category that should be ignored? |
Hi , I am working on a ASR related project by using conformer. The four dim output is has confused me for calculating the loss to train the ASR model Would you please provide an example for the calculation of the loss ? Kind Regards |
It should be recongnised as the blank symbol |
do you have some idea? I am confused about that too. |
|
@sooftware can you please answer to @zwan074 ? many of us are confused as to how to use a loss function to train the conformer as the outputs are log probabilities of model prediction in 4 dimensions |
Sorry for the late response. I recommend checking this project |
I have another question about the function of the conformer. |
Show me the code. |
@sooftware When I execute the following code, recognize_sp variable has the shape: [32, 289]
|
As per the https://github.com/openspeech-team/openspeech project When training the conformer model, it uses conformer block to compute the output for a ctc loss. The LSTM decoder layer is unused .. code is as below: `
|
@jcgeo9 289 is almost a quarter of 1162. This phenomenon occurs due to Conv2dSubampling during the convolution block of the Conformer. |
@sooftware hmm ok but what do i do with that? i mean how do i convert it to what i actually want? isnt it suppose to return [32, 20] tensor containing integers that correspond to words from my vocabulary that will then be converted with itos in order to check the loss? |
I updated the code and README because many people seemed to have a hard time calculating losses. import torch
import torch.nn as nn
from conformer import Conformer
batch_size, sequence_length, dim = 3, 12345, 80
cuda = torch.cuda.is_available()
device = torch.device('cuda' if cuda else 'cpu')
criterion = nn.CTCLoss()
inputs = torch.rand(batch_size, sequence_length, dim).to(device)
input_lengths = torch.IntTensor([12345, 12300, 12000])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
[1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
[1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])
model = Conformer(num_classes=10,
input_dim=dim,
encoder_dim=32,
num_encoder_layers=3)
# Forward propagate
outputs, output_lengths = model(inputs, input_lengths)
# Calculate CTC Loss
loss = criterion(outputs.transpose(0, 1), targets, output_lengths, target_lengths) |
I have a question. The input_lengths has not send to calculate the mask for mulithead-attention. Is it work? |
Hello. I am currently using this package. I'm afraid this may be a basic question, but I'd like to ask a question.
1 Is the input a spectrogram or raw audio data?
2 When I run model(x,x_len,target,target_len), I get a four-dimensional output (batch, join_len,target_len,class_num) due to the calculation of the loss function.
I wanted to see the recognition result, so I used model.recognize(x,x_len), but the shape of the output was (batch,join_len). But the shape of the output was (batch,join_len). I would like to see it with (batch,target_len). What is the process of recognizing?
The text was updated successfully, but these errors were encountered: