Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[paraformer] When is ONNX GPU export supported. #2503

Closed
willnufe opened this issue Apr 26, 2024 · 8 comments
Closed

[paraformer] When is ONNX GPU export supported. #2503

willnufe opened this issue Apr 26, 2024 · 8 comments
Labels
enhancement New feature or request Stale

Comments

@willnufe
Copy link

0. 【问题】[paraformer]When is ONNX GPU export supported?

1. 版本**【wenet-v3.0.1】**

2. 尝试对 paraformer onnx gpu 进行转换

  1. 基于下面的forward函数[wenet-main/examples/aishell/paraformer/wenet/paraformer/paraformer.py],我做了paraformer onnx gpu的转换尝试,
    def forward_paraformer(
        self,
        speech: torch.Tensor,
        speech_lengths: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        res = self._forward_paraformer(speech, speech_lengths)
        return res['decoder_out'], res['decoder_out_lens'], res['tp_alphas']
  1. 但是 dynamic_axes 并没有生效,只能处理与 speech, speech_lengths 完全相同shape的数据,同时,speech源音频长度达到一分钟时,模型转换和onnxruntime-gpu 推理时的加载速度都非常慢。
torch.onnx.export(
            model,
            (speech, speech_lengths),
            model_path,
            export_params=True,
            opset_version=13,
            do_constant_folding=True,
            input_names=["speech", "speech_lengths"],
            output_names=[
                "decoder_out",
                "decoder_out_lens",
                "tp_alphas"
            ],
            dynamic_axes={
                "speech": {
                    0: "B",
                    1: "T"
                },
                "speech_lengths": {
                    0: "B"
                },
                "decoder_out": {
                    0: "B",
                    1: "T_OUT"
                },
                "decoder_out_lens": {
                    0: "B"
                },
                "tp_alphas": {
                    0: "B",
                    1: "T_OUT1"
                },
            },
            verbose=True,
        ) 
@Mddct
Copy link
Collaborator

Mddct commented May 6, 2024

关注一下cif那部分的转onnx

@willnufe
Copy link
Author

willnufe commented May 7, 2024

@Mddct

关注一下cif那部分的转onnx

是的,确实主要是CIF那块儿的问题;
我做了一些尝试,目前可以成功转换支持动态输入的 onnx-gpu 模型,但是还是存在一些问题。

  1. wenet-main/examples/aishell/paraformer/wenet/utils/mask.py
# 这里加item(),会导致onnx转换模型无法支持动态维度,用netron查看网络结构,会发现这里会变成一个固定值
# max_len = max_len if max_len > 0 else lengths.max().item()  
max_len = lengths.max() 
  1. wenet-main/examples/aishell/paraformer/wenet/paraformer/cif.py
class Cif(nn.Module):
    def forward():
        if target_length is None and self.tail_threshold > 0.0:

            # 这块儿好像也有点问题,会提示 int32 和 int64 不兼容的问题

            # token_num_int = torch.max(token_num).type(torch.int32).item()
            token_num_int = torch.max(token_num).type(torch.int64)
            acoustic_embeds = acoustic_embeds[:, :token_num_int, :]     
  1. wenet-main/examples/aishell/paraformer/wenet/paraformer/cif.py
    1. 尝试

      • 此处的CIF函数,主体使用的是for循环,直接导出的话,其层数会固定,与onnx转换时的层数一致
      • 增加 @torch.jit.script,虽然支持动态维度,但是得到的模型耗时严重
    2. 解决

    3. 现阶段问题

      • 模型推理耗时不稳定,对于padding到同一长度的音频文件(60s),for循环测试,耗时从150ms ~ 2000ms不等;
      • 性能方面有一定的损失;

@Mddct
Copy link
Collaborator

Mddct commented May 7, 2024

第一部分会重构下这个函数 这个函数也会影响torch.compile等函数

第二部分 onnx 是支持for 循环导出的这里我有时间整一下, 你可以搜一下torch for loop to onnx; 或者可以对齐下parallel cif的实现 提个pr过来

@Mddct
Copy link
Collaborator

Mddct commented May 7, 2024

@whisper-yu #2515 帮忙mask试下这个 🙏

@willnufe
Copy link
Author

willnufe commented May 7, 2024

第二部分 onnx 是支持for 循环导出的这里我有时间整一下, 你可以搜一下torch for loop to onnx; 或者可以对齐下parallel cif的实现 提个pr过来

感谢!那我这里先试一下onnx for循环导出的问题,看能不能解决;

但是推理耗时不稳定的问题,会是CIF这块儿的问题吗?

@Mddct
Copy link
Collaborator

Mddct commented May 7, 2024

第二部分 onnx 是支持for 循环导出的这里我有时间整一下, 你可以搜一下torch for loop to onnx; 或者可以对齐下parallel cif的实现 提个pr过来

感谢!那我这里先试一下onnx for循环导出的问题,看能不能解决;

但是推理耗时不稳定的问题,会是CIF这块儿的问题吗?

应该是 其他结构都类transformer 推理应该很稳定

@willnufe
Copy link
Author

willnufe commented May 8, 2024

@Mddct

应该是 其他结构都类transformer 推理应该很稳定

我做了一些测试,但是跟我原本猜想的好像不太一样:

  • 实际上慢的部分是decoder(并且是时快时慢,反而encoder predictor部分的耗时很稳定);
    • 这块儿我经验不多,我有考虑是否是资源受限的问题,但感觉又不合理
  • 性能有一定的衰减,自有数据集上字错率由30%(funasr接口直测)涨到了36%,感觉可能 CIF没对齐的问题(因为主要更改了这个部分);

测试过程

我把 encoder predictor decoder 分别导出为 onnx-gpu模型,然后单独测试其耗时;

1. 测试结果【测试音频不同】

  • 耗时单位为 ms 毫秒
    image

2. 测试代码:

def infer_onnx(wav_path, model, tokenizer):

    # pre
    start_0 = time.time()
    wav, sr = torchaudio.load(wav_path)

    # padding
    padding_length = int(60 * sr - wav.shape[1])
    padding = torch.zeros(1, padding_length) + 0.00001
    wav = torch.cat([wav, padding], dim=1)

    data = wav.squeeze()
    data = [data]

    speech, speech_lengths = extract_fbank(data)
    # 这里没有将LFR放入 encoder,是因为 其中有算子不支持!!!
    lfr = LFR()
    speech, speech_lengths = lfr(speech, speech_lengths)
    end_0 = time.time()
    total_0 = int((end_0 - start_0) * 1000)


    # encoder
    start_1 = time.time()
    encoder_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lengths),
    }
    encoder_out, encoder_out_mask = encoder_session.run(None, encoder_inputs)
    end_1 = time.time()
    total_1 = int((end_1 - start_1) * 1000)


    # predictor
    start_2 = time.time()
    predictor_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
    }
    acoustic_embed, token_num, tp_alphas = predictor_session.run(None, predictor_inputs)    

    end_2 = time.time()
    total_2 = int((end_2 - start_2) * 1000)




    # decoder
    start_3 = time.time()
    decoder_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
        "acoustic_embed": acoustic_embed,
        "token_num": token_num,
    }
    decoder_out = decoder_session.run(None, decoder_inputs)
    decoder_out = decoder_out[0]

    end_3 = time.time()
    total_3 = int((end_3 - start_3) *  1000)



    # post
    start_4 = time.time()
    decoder_out = torch.tensor(decoder_out, dtype=torch.float32)
    decoder_out_lens = torch.tensor(token_num, dtype=torch.int32)
    tp_alphas = torch.tensor(tp_alphas, dtype=torch.float32)

    peaks = model.forward_cif_peaks(tp_alphas, decoder_out_lens)
    paraformer_greedy_result = paraformer_greedy_search(
        decoder_out, decoder_out_lens, peaks)
    
    results = {
        "paraformer_greedy_result": paraformer_greedy_result
    }

    for i in range(len(data)):
        for mode, hyps in results.items():
            tokens = hyps[i].tokens
            line = '{}'.format(tokenizer.detokenize(tokens)[0])

    end_4 = time.time()
    total_4 = int((end_4 - start_4) * 1000)

    print(f"[pre]-{total_0} ||[encoder]-{total_1} ||[predictor]-{total_2} ||[decoder]-{total_3} ||[post]-{total_4}")


    return line

3. decoder onnx 导出代码

# forward部分
class Paraformer(ASRModel):

    # DECODER
    @torch.jit.export
    def forward_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_out_mask: torch.Tensor,
        acoustic_embed: torch.Tensor,
        token_num: torch.Tensor
    ) -> torch.Tensor:
        
        # decoder
        decoder_out, _, _ = self.decoder(encoder_out, encoder_out_mask,
                                         acoustic_embed, token_num)
        decoder_out = decoder_out.log_softmax(dim=-1)

        return decoder_out  


##############################################################
# decoder  onnx模型导出部分
if not os.path.exists(decoder_path):
    print("\n\n[export decoder]")
    model.forward = model.forward_decoder
    torch.onnx.export(
        model,
        (encoder_out, encoder_out_mask, acoustic_embed, token_num),
        decoder_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["encoder_out", "encoder_out_mask", "acoustic_embed", "token_num"],
        output_names=[
            "decoder_out",
        ],
        dynamic_axes={
            "encoder_out": {
                0: "B",
                1: "T_E"
            },
            "encoder_out_mask": {
                0: "B",
                2: 'T_E'
            },
            "acoustic_embed":{
                0: "B",
                1: "T_P"                    
            },
            "token_num":{
                0: "B"                        
            },


            "decoder_out":{
                0: "B",
                1: "T_P"
            },               
        },
        verbose=True,
    )  




@shatealaboxiaowang
Copy link

@Mddct

应该是 其他结构都类transformer 推理应该很稳定

我做了一些测试,但是跟我原本猜想的好像不太一样:

  • 实际上慢的部分是decoder(并且是时快时慢,反而encoder predictor部分的耗时很稳定);

    • 这块儿我经验不多,我有考虑是否是资源受限的问题,但感觉又不合理
  • 性能有一定的衰减,自有数据集上字错率由30%(funasr接口直测)涨到了36%,感觉可能 CIF没对齐的问题(因为主要更改了这个部分);

测试过程

我把 encoder predictor decoder 分别导出为 onnx-gpu模型,然后单独测试其耗时;

1. 测试结果【测试音频不同】

  • 耗时单位为 ms 毫秒
    image

2. 测试代码:

def infer_onnx(wav_path, model, tokenizer):

    # pre
    start_0 = time.time()
    wav, sr = torchaudio.load(wav_path)

    # padding
    padding_length = int(60 * sr - wav.shape[1])
    padding = torch.zeros(1, padding_length) + 0.00001
    wav = torch.cat([wav, padding], dim=1)

    data = wav.squeeze()
    data = [data]

    speech, speech_lengths = extract_fbank(data)
    # 这里没有将LFR放入 encoder,是因为 其中有算子不支持!!!
    lfr = LFR()
    speech, speech_lengths = lfr(speech, speech_lengths)
    end_0 = time.time()
    total_0 = int((end_0 - start_0) * 1000)


    # encoder
    start_1 = time.time()
    encoder_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lengths),
    }
    encoder_out, encoder_out_mask = encoder_session.run(None, encoder_inputs)
    end_1 = time.time()
    total_1 = int((end_1 - start_1) * 1000)


    # predictor
    start_2 = time.time()
    predictor_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
    }
    acoustic_embed, token_num, tp_alphas = predictor_session.run(None, predictor_inputs)    

    end_2 = time.time()
    total_2 = int((end_2 - start_2) * 1000)




    # decoder
    start_3 = time.time()
    decoder_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
        "acoustic_embed": acoustic_embed,
        "token_num": token_num,
    }
    decoder_out = decoder_session.run(None, decoder_inputs)
    decoder_out = decoder_out[0]

    end_3 = time.time()
    total_3 = int((end_3 - start_3) *  1000)



    # post
    start_4 = time.time()
    decoder_out = torch.tensor(decoder_out, dtype=torch.float32)
    decoder_out_lens = torch.tensor(token_num, dtype=torch.int32)
    tp_alphas = torch.tensor(tp_alphas, dtype=torch.float32)

    peaks = model.forward_cif_peaks(tp_alphas, decoder_out_lens)
    paraformer_greedy_result = paraformer_greedy_search(
        decoder_out, decoder_out_lens, peaks)
    
    results = {
        "paraformer_greedy_result": paraformer_greedy_result
    }

    for i in range(len(data)):
        for mode, hyps in results.items():
            tokens = hyps[i].tokens
            line = '{}'.format(tokenizer.detokenize(tokens)[0])

    end_4 = time.time()
    total_4 = int((end_4 - start_4) * 1000)

    print(f"[pre]-{total_0} ||[encoder]-{total_1} ||[predictor]-{total_2} ||[decoder]-{total_3} ||[post]-{total_4}")


    return line

3. decoder onnx 导出代码

# forward部分
class Paraformer(ASRModel):

    # DECODER
    @torch.jit.export
    def forward_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_out_mask: torch.Tensor,
        acoustic_embed: torch.Tensor,
        token_num: torch.Tensor
    ) -> torch.Tensor:
        
        # decoder
        decoder_out, _, _ = self.decoder(encoder_out, encoder_out_mask,
                                         acoustic_embed, token_num)
        decoder_out = decoder_out.log_softmax(dim=-1)

        return decoder_out  


##############################################################
# decoder  onnx模型导出部分
if not os.path.exists(decoder_path):
    print("\n\n[export decoder]")
    model.forward = model.forward_decoder
    torch.onnx.export(
        model,
        (encoder_out, encoder_out_mask, acoustic_embed, token_num),
        decoder_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["encoder_out", "encoder_out_mask", "acoustic_embed", "token_num"],
        output_names=[
            "decoder_out",
        ],
        dynamic_axes={
            "encoder_out": {
                0: "B",
                1: "T_E"
            },
            "encoder_out_mask": {
                0: "B",
                2: 'T_E'
            },
            "acoustic_embed":{
                0: "B",
                1: "T_P"                    
            },
            "token_num":{
                0: "B"                        
            },


            "decoder_out":{
                0: "B",
                1: "T_P"
            },               
        },
        verbose=True,
    )  

Is this issue fixed?

@Mddct Mddct added the enhancement New feature or request label Jun 7, 2024
@github-actions github-actions bot added the Stale label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

3 participants