[paraformer] When is ONNX GPU export supported. #2503

willnufe · 2024-04-26T01:46:58Z

0. 【问题】[paraformer]When is ONNX GPU export supported?

1. 版本【wenet-v3.0.1】

2. 尝试对 paraformer onnx gpu 进行转换

基于下面的forward函数[wenet-main/examples/aishell/paraformer/wenet/paraformer/paraformer.py]，我做了paraformer onnx gpu的转换尝试，

    def forward_paraformer(
        self,
        speech: torch.Tensor,
        speech_lengths: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        res = self._forward_paraformer(speech, speech_lengths)
        return res['decoder_out'], res['decoder_out_lens'], res['tp_alphas']

但是 dynamic_axes 并没有生效，只能处理与 speech, speech_lengths 完全相同shape的数据，同时，speech源音频长度达到一分钟时，模型转换和onnxruntime-gpu 推理时的加载速度都非常慢。

torch.onnx.export(
            model,
            (speech, speech_lengths),
            model_path,
            export_params=True,
            opset_version=13,
            do_constant_folding=True,
            input_names=["speech", "speech_lengths"],
            output_names=[
                "decoder_out",
                "decoder_out_lens",
                "tp_alphas"
            ],
            dynamic_axes={
                "speech": {
                    0: "B",
                    1: "T"
                },
                "speech_lengths": {
                    0: "B"
                },
                "decoder_out": {
                    0: "B",
                    1: "T_OUT"
                },
                "decoder_out_lens": {
                    0: "B"
                },
                "tp_alphas": {
                    0: "B",
                    1: "T_OUT1"
                },
            },
            verbose=True,
        )

The text was updated successfully, but these errors were encountered:

Mddct · 2024-05-06T13:21:53Z

关注一下cif那部分的转onnx

willnufe · 2024-05-07T02:49:07Z

@Mddct

关注一下cif那部分的转onnx

是的，确实主要是CIF那块儿的问题；
我做了一些尝试，目前可以成功转换支持动态输入的 onnx-gpu 模型，但是还是存在一些问题。

wenet-main/examples/aishell/paraformer/wenet/utils/mask.py

# 这里加item()，会导致onnx转换模型无法支持动态维度，用netron查看网络结构，会发现这里会变成一个固定值
# max_len = max_len if max_len > 0 else lengths.max().item()  
max_len = lengths.max()

wenet-main/examples/aishell/paraformer/wenet/paraformer/cif.py

class Cif(nn.Module):
    def forward():
        if target_length is None and self.tail_threshold > 0.0:

            # 这块儿好像也有点问题，会提示 int32 和 int64 不兼容的问题

            # token_num_int = torch.max(token_num).type(torch.int32).item()
            token_num_int = torch.max(token_num).type(torch.int64)
            acoustic_embeds = acoustic_embeds[:, :token_num_int, :]

wenet-main/examples/aishell/paraformer/wenet/paraformer/cif.py
1. 尝试
  - 此处的CIF函数，主体使用的是for循环，直接导出的话，其层数会固定，与onnx转换时的层数一致
  - 增加 @torch.jit.script，虽然支持动态维度，但是得到的模型耗时严重
2. 解决
  - 将其替换成 https://github.com/George0828Zhang/torch_cif 中的并行处理方式，可以支持动态输入
3. 现阶段问题
  - 模型推理耗时不稳定，对于padding到同一长度的音频文件（60s），for循环测试，耗时从150ms ~ 2000ms不等；
  - 性能方面有一定的损失；

Mddct · 2024-05-07T05:34:06Z

第一部分会重构下这个函数这个函数也会影响torch.compile等函数

第二部分 onnx 是支持for 循环导出的这里我有时间整一下，你可以搜一下torch for loop to onnx；或者可以对齐下parallel cif的实现提个pr过来

Mddct · 2024-05-07T06:40:47Z

@whisper-yu #2515 帮忙mask试下这个 🙏

willnufe · 2024-05-07T09:35:23Z

第二部分 onnx 是支持for 循环导出的这里我有时间整一下，你可以搜一下torch for loop to onnx；或者可以对齐下parallel cif的实现提个pr过来

感谢！那我这里先试一下onnx for循环导出的问题，看能不能解决；

但是推理耗时不稳定的问题，会是CIF这块儿的问题吗？

Mddct · 2024-05-07T11:29:11Z

第二部分 onnx 是支持for 循环导出的这里我有时间整一下，你可以搜一下torch for loop to onnx；或者可以对齐下parallel cif的实现提个pr过来

感谢！那我这里先试一下onnx for循环导出的问题，看能不能解决；

但是推理耗时不稳定的问题，会是CIF这块儿的问题吗？

应该是其他结构都类transformer 推理应该很稳定

willnufe · 2024-05-08T02:28:08Z

@Mddct

应该是其他结构都类transformer 推理应该很稳定

我做了一些测试，但是跟我原本猜想的好像不太一样：

实际上慢的部分是decoder（并且是时快时慢，反而encoder predictor部分的耗时很稳定）;
- 这块儿我经验不多，我有考虑是否是资源受限的问题，但感觉又不合理
性能有一定的衰减，自有数据集上字错率由30%（funasr接口直测）涨到了36%，感觉可能 CIF没对齐的问题（因为主要更改了这个部分）；

测试过程

我把 encoder predictor decoder 分别导出为 onnx-gpu模型，然后单独测试其耗时；

1. 测试结果【测试音频不同】

耗时单位为 ms 毫秒

2. 测试代码：

def infer_onnx(wav_path, model, tokenizer):

    # pre
    start_0 = time.time()
    wav, sr = torchaudio.load(wav_path)

    # padding
    padding_length = int(60 * sr - wav.shape[1])
    padding = torch.zeros(1, padding_length) + 0.00001
    wav = torch.cat([wav, padding], dim=1)

    data = wav.squeeze()
    data = [data]

    speech, speech_lengths = extract_fbank(data)
    # 这里没有将LFR放入 encoder，是因为 其中有算子不支持！！！
    lfr = LFR()
    speech, speech_lengths = lfr(speech, speech_lengths)
    end_0 = time.time()
    total_0 = int((end_0 - start_0) * 1000)


    # encoder
    start_1 = time.time()
    encoder_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lengths),
    }
    encoder_out, encoder_out_mask = encoder_session.run(None, encoder_inputs)
    end_1 = time.time()
    total_1 = int((end_1 - start_1) * 1000)


    # predictor
    start_2 = time.time()
    predictor_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
    }
    acoustic_embed, token_num, tp_alphas = predictor_session.run(None, predictor_inputs)    

    end_2 = time.time()
    total_2 = int((end_2 - start_2) * 1000)




    # decoder
    start_3 = time.time()
    decoder_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
        "acoustic_embed": acoustic_embed,
        "token_num": token_num,
    }
    decoder_out = decoder_session.run(None, decoder_inputs)
    decoder_out = decoder_out[0]

    end_3 = time.time()
    total_3 = int((end_3 - start_3) *  1000)



    # post
    start_4 = time.time()
    decoder_out = torch.tensor(decoder_out, dtype=torch.float32)
    decoder_out_lens = torch.tensor(token_num, dtype=torch.int32)
    tp_alphas = torch.tensor(tp_alphas, dtype=torch.float32)

    peaks = model.forward_cif_peaks(tp_alphas, decoder_out_lens)
    paraformer_greedy_result = paraformer_greedy_search(
        decoder_out, decoder_out_lens, peaks)
    
    results = {
        "paraformer_greedy_result": paraformer_greedy_result
    }

    for i in range(len(data)):
        for mode, hyps in results.items():
            tokens = hyps[i].tokens
            line = '{}'.format(tokenizer.detokenize(tokens)[0])

    end_4 = time.time()
    total_4 = int((end_4 - start_4) * 1000)

    print(f"[pre]-{total_0} ||[encoder]-{total_1} ||[predictor]-{total_2} ||[decoder]-{total_3} ||[post]-{total_4}")


    return line

3. decoder onnx 导出代码

# forward部分
class Paraformer(ASRModel):

    # DECODER
    @torch.jit.export
    def forward_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_out_mask: torch.Tensor,
        acoustic_embed: torch.Tensor,
        token_num: torch.Tensor
    ) -> torch.Tensor:
        
        # decoder
        decoder_out, _, _ = self.decoder(encoder_out, encoder_out_mask,
                                         acoustic_embed, token_num)
        decoder_out = decoder_out.log_softmax(dim=-1)

        return decoder_out  


##############################################################
# decoder  onnx模型导出部分
if not os.path.exists(decoder_path):
    print("\n\n[export decoder]")
    model.forward = model.forward_decoder
    torch.onnx.export(
        model,
        (encoder_out, encoder_out_mask, acoustic_embed, token_num),
        decoder_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["encoder_out", "encoder_out_mask", "acoustic_embed", "token_num"],
        output_names=[
            "decoder_out",
        ],
        dynamic_axes={
            "encoder_out": {
                0: "B",
                1: "T_E"
            },
            "encoder_out_mask": {
                0: "B",
                2: 'T_E'
            },
            "acoustic_embed":{
                0: "B",
                1: "T_P"                    
            },
            "token_num":{
                0: "B"                        
            },


            "decoder_out":{
                0: "B",
                1: "T_P"
            },               
        },
        verbose=True,
    )

shatealaboxiaowang · 2024-05-14T06:05:57Z

@Mddct

应该是其他结构都类transformer 推理应该很稳定

我做了一些测试，但是跟我原本猜想的好像不太一样：

实际上慢的部分是decoder（并且是时快时慢，反而encoder predictor部分的耗时很稳定）;
- 这块儿我经验不多，我有考虑是否是资源受限的问题，但感觉又不合理
性能有一定的衰减，自有数据集上字错率由30%（funasr接口直测）涨到了36%，感觉可能 CIF没对齐的问题（因为主要更改了这个部分）；

测试过程

我把 encoder predictor decoder 分别导出为 onnx-gpu模型，然后单独测试其耗时；

1. 测试结果【测试音频不同】

耗时单位为 ms 毫秒

2. 测试代码：

def infer_onnx(wav_path, model, tokenizer):

    # pre
    start_0 = time.time()
    wav, sr = torchaudio.load(wav_path)

    # padding
    padding_length = int(60 * sr - wav.shape[1])
    padding = torch.zeros(1, padding_length) + 0.00001
    wav = torch.cat([wav, padding], dim=1)

    data = wav.squeeze()
    data = [data]

    speech, speech_lengths = extract_fbank(data)
    # 这里没有将LFR放入 encoder，是因为 其中有算子不支持！！！
    lfr = LFR()
    speech, speech_lengths = lfr(speech, speech_lengths)
    end_0 = time.time()
    total_0 = int((end_0 - start_0) * 1000)


    # encoder
    start_1 = time.time()
    encoder_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lengths),
    }
    encoder_out, encoder_out_mask = encoder_session.run(None, encoder_inputs)
    end_1 = time.time()
    total_1 = int((end_1 - start_1) * 1000)


    # predictor
    start_2 = time.time()
    predictor_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
    }
    acoustic_embed, token_num, tp_alphas = predictor_session.run(None, predictor_inputs)    

    end_2 = time.time()
    total_2 = int((end_2 - start_2) * 1000)




    # decoder
    start_3 = time.time()
    decoder_inputs = {
        "encoder_out": encoder_out,
        "encoder_out_mask": encoder_out_mask,
        "acoustic_embed": acoustic_embed,
        "token_num": token_num,
    }
    decoder_out = decoder_session.run(None, decoder_inputs)
    decoder_out = decoder_out[0]

    end_3 = time.time()
    total_3 = int((end_3 - start_3) *  1000)



    # post
    start_4 = time.time()
    decoder_out = torch.tensor(decoder_out, dtype=torch.float32)
    decoder_out_lens = torch.tensor(token_num, dtype=torch.int32)
    tp_alphas = torch.tensor(tp_alphas, dtype=torch.float32)

    peaks = model.forward_cif_peaks(tp_alphas, decoder_out_lens)
    paraformer_greedy_result = paraformer_greedy_search(
        decoder_out, decoder_out_lens, peaks)
    
    results = {
        "paraformer_greedy_result": paraformer_greedy_result
    }

    for i in range(len(data)):
        for mode, hyps in results.items():
            tokens = hyps[i].tokens
            line = '{}'.format(tokenizer.detokenize(tokens)[0])

    end_4 = time.time()
    total_4 = int((end_4 - start_4) * 1000)

    print(f"[pre]-{total_0} ||[encoder]-{total_1} ||[predictor]-{total_2} ||[decoder]-{total_3} ||[post]-{total_4}")


    return line

3. decoder onnx 导出代码

# forward部分
class Paraformer(ASRModel):

    # DECODER
    @torch.jit.export
    def forward_decoder(
        self,
        encoder_out: torch.Tensor,
        encoder_out_mask: torch.Tensor,
        acoustic_embed: torch.Tensor,
        token_num: torch.Tensor
    ) -> torch.Tensor:
        
        # decoder
        decoder_out, _, _ = self.decoder(encoder_out, encoder_out_mask,
                                         acoustic_embed, token_num)
        decoder_out = decoder_out.log_softmax(dim=-1)

        return decoder_out  


##############################################################
# decoder  onnx模型导出部分
if not os.path.exists(decoder_path):
    print("\n\n[export decoder]")
    model.forward = model.forward_decoder
    torch.onnx.export(
        model,
        (encoder_out, encoder_out_mask, acoustic_embed, token_num),
        decoder_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["encoder_out", "encoder_out_mask", "acoustic_embed", "token_num"],
        output_names=[
            "decoder_out",
        ],
        dynamic_axes={
            "encoder_out": {
                0: "B",
                1: "T_E"
            },
            "encoder_out_mask": {
                0: "B",
                2: 'T_E'
            },
            "acoustic_embed":{
                0: "B",
                1: "T_P"                    
            },
            "token_num":{
                0: "B"                        
            },


            "decoder_out":{
                0: "B",
                1: "T_P"
            },               
        },
        verbose=True,
    )

Is this issue fixed?

Mddct added the enhancement New feature or request label Jun 7, 2024

github-actions bot added the Stale label Aug 7, 2024

github-actions bot closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[paraformer] When is ONNX GPU export supported. #2503

[paraformer] When is ONNX GPU export supported. #2503

willnufe commented Apr 26, 2024

Mddct commented May 6, 2024

willnufe commented May 7, 2024

Mddct commented May 7, 2024

Mddct commented May 7, 2024 •

edited

Loading

willnufe commented May 7, 2024

Mddct commented May 7, 2024

willnufe commented May 8, 2024

shatealaboxiaowang commented May 14, 2024

测试过程

1. 测试结果【测试音频不同】

2. 测试代码：

3. decoder onnx 导出代码

[paraformer] When is ONNX GPU export supported. #2503

[paraformer] When is ONNX GPU export supported. #2503

Comments

willnufe commented Apr 26, 2024

0. 【问题】[paraformer]When is ONNX GPU export supported?

1. 版本**【wenet-v3.0.1】**

2. 尝试对 paraformer onnx gpu 进行转换

Mddct commented May 6, 2024

willnufe commented May 7, 2024

Mddct commented May 7, 2024

Mddct commented May 7, 2024 • edited Loading

willnufe commented May 7, 2024

Mddct commented May 7, 2024

willnufe commented May 8, 2024

测试过程

1. 测试结果【测试音频不同】

2. 测试代码：

3. decoder onnx 导出代码

shatealaboxiaowang commented May 14, 2024

测试过程

1. 测试结果【测试音频不同】

2. 测试代码：

3. decoder onnx 导出代码

1. 版本【wenet-v3.0.1】

Mddct commented May 7, 2024 •

edited

Loading