Slow TensorRT Inference Speed on Jetson Orin NX #35

zzzzzyh111 · 2024-06-20T08:11:21Z

Thank you for your excellent work! 😆 😆 😆

Recently, I have been trying to use TensorRT to accelerate Depth Anything on Jetson Orin NX. However, I found that the inference speed of the converted TRT file does not significantly improve compared to the ONNX file, and it even decreases. Specifically:

ONNX Inference Time: 2.7s per image

TRT Inference Time: 3.0s per image

The library versions are as follows:

- JetPack: 5.1
- CUDA: 11.4.315
- cuDNN: 8.6.0.166
- TensorRT: 8.5.2.2
- VPI: 2.2.4
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
- torch: 2.1.0
- torchvision: 0.16.0
- onnx: 1.16.1
- onnxruntime: 1.8.0

The function to convert the .pth file to an ONNX file is as follows:

model_name = "zoedepth"
pretrained_resource = "local::./checkpoints/ZoeDepthIndoor_05-Jun_15-11-ebbebc6c1002_best.pt"
dataset = None
overwrite = {"pretrained_resource": pretrained_resource}
config = get_config(model_name, "eval", dataset, **overwrite)
model = build_model(config)
model.eval() 
dummy_input = torch.randn(1, 3, 392, 518)
 _ = model(dummy_input)
torch.onnx.export(model, dummy_input, "ZoeDepth_indoor.onnx", verbose=True)
torch.onnx.export(
            model,
             dummy_input, 
             "./checkpoints/ZoeDepth_indoor_jetson.onnx", 
             opset_version=11, 
             input_names=["input"], 
             output_names=["output"], 
)

The function to convert the ONNX file to a TRT file is as follows:

def build_engine(onnx_file_path):
    onnx_file_path = Path(onnx_file_path)
    # ONNX to TensorRT
    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise ValueError('Faled to parse the ONNX model.')

    # Set up the builder config
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.FP16)  # FP16
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)  # 2 GB

    serialized_engine = builder.build_serialized_network(network, config)

    with open(onnx_file_path.with_suffix(".trt"), "wb") as f:
        f.write(serialized_engine)

The function to perform inference using the TRT file is as follows:

def infer_trt(engine, input_image):
    input_image = input_image.cpu().numpy().astype(np.float32)
    context = engine.create_execution_context()
    height, width = input_image.shape[2], input_image.shape[3]
    output_shape = (1, 1, height, width)
    # Allocate pagelocked memory
    h_input = cuda.pagelocked_empty(trt.volume((1, 3, height, width)), dtype=np.float32)
    h_output = cuda.pagelocked_empty(trt.volume((1, 1, height, width)), dtype=np.float32)

    # Allocate device memory
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)

    bindings = [int(d_input), int(d_output)]
    stream = cuda.Stream()
    # Function to perform inference
    def perform_inference(images_np):
        np.copyto(h_input, images_np.ravel())
        cuda.memcpy_htod_async(d_input, h_input, stream)
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        stream.synchronize()
        return torch.tensor(h_output).view(output_shape)
        # Run inference on original images

    pred1 = perform_inference(input_image)

    # Run inference on flipped images
    flipped_images_np = np.flip(input_image, axis=3)
    pred2 = perform_inference(flipped_images_np)
    pred2 = torch.flip(pred2, [3])
    mean_pred = 0.5 * (pred1 + pred2)
    return mean_pred

The code runs without any issues, except for some warnings during the ONNX conversion. However, the final results are still not satisfactory. Looking forward to your response! ❤️ ❤️ ❤️

The text was updated successfully, but these errors were encountered:

spacewalk01 · 2024-06-21T07:04:40Z

Previously I have never seen tensorrt runs slower than onnx.

zzzzzyh111 · 2024-06-24T06:01:09Z

Thanks for your prompt reply! Am I correct in understanding that if nothing goes wrong during the conversion of onnx files to trt files, the acceleration should theoretically be achieved?

spacewalk01 · 2024-06-24T06:10:37Z

Yes. will you try tensorrt cpp version?

zzzzzyh111 · 2024-06-30T11:45:35Z

Since I'm unfamiliar with C++, I'm currently focusing on the Python version and using your work as a reference. If our previous discussion is correct, then the data loading and preprocessing process in my script might consume most of the time. I will continue investigating to find out the cause. If everything still seems to be good but it's not working, I will attempt the C++ version and let you know.

Thank you again for your prompt reply!

akashkrishnapm · 2024-07-03T08:50:12Z

Was it solved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow TensorRT Inference Speed on Jetson Orin NX #35

Slow TensorRT Inference Speed on Jetson Orin NX #35

zzzzzyh111 commented Jun 20, 2024

spacewalk01 commented Jun 21, 2024

zzzzzyh111 commented Jun 24, 2024

spacewalk01 commented Jun 24, 2024

zzzzzyh111 commented Jun 30, 2024

akashkrishnapm commented Jul 3, 2024

Slow TensorRT Inference Speed on Jetson Orin NX #35

Slow TensorRT Inference Speed on Jetson Orin NX #35

Comments

zzzzzyh111 commented Jun 20, 2024

spacewalk01 commented Jun 21, 2024

zzzzzyh111 commented Jun 24, 2024

spacewalk01 commented Jun 24, 2024

zzzzzyh111 commented Jun 30, 2024

akashkrishnapm commented Jul 3, 2024