Getting started
Benchmark
Jetson nano versus Raspberry Pi 4
Jetson Xavier NX versus Jetson TX2
Pre-processing operations
Coming next
Known issues and possible fixes
Technical questions

This document is about NVIDIA TensorRT in general but will focus on NVIDIA Jetson devices (TX1, TX2, Nano, Xavier AGX/NX...).

Starting version 3.1.0 we support full GPGPU acceleration for NVIDIA Jetson devices using NVIDIA TensorRT and TF-TRT.

The SDK was tested using JetPack 4.4.1, the latest version from NVIDIA and we will not provide support for any other version.
This repo contains two set of binaries: binaries/jetson and binaries/jetson_tftrt

Getting started

As explained above, we use both NVIDIA TensorRT and TF-TRT.

NVIDIA TensorRT is used for:
- License plate and car detection
- License Plate Country Identification (LPCI)
- Vehicle Color Recognition (VCR)
- Vehicle Make Model Recognition (VMMR)
- Vehicle Body Style Recognition (VBSR)
- Vehicle Direction Tracking (VDT)
- Vehicle Speed Estimation (VSE)
TF-TRT is used for:
- License Plate Recognition (LPR)

NVIDIA TensorRT is natively supported by all Jetson devices once flashed with Jetpack while TF-TRT requires TensorFlow binaries with TensorRT support. You don't need to worry about building Tensorflow with support for TensorRT by yourself, this repo contains all required binaries.

Requirements

We require CUDA 10.2, cuDNN 8.0 and TensorRT 7+. To make your life easier, just install JetPack 4.4.1. As of today (11/16/2020), version 4.4.1 is the latest one.

Before trying to use the SDK on Jetson

Please note that this repo doesn't contain optimized TensorRT models and you'll not be able to use the SDK unless you generate these models. More info about model optimization at https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html. Fortunately we made this task very easy by writing an optimizer using TensorRT C++.

Building optimized models

This process will write the optimized models (a.k.a plans) to the local disk which means we'll need write permission. We recommend running the next commands as root(#) instead of normal user($). To generate the optimized models:

Navigate to the jetson binaries folder: cd ultimateALPR-SDK/binaries/jetson/aarch64 or cd ultimateALPR-SDK/binaries/jetson_tftrt/aarch64
Generate the optimized models: sudo chmod +x ./prepare.sh && sudo ./prepare.sh

This will build the models using CUDA engine and serialize the optimized models into assets/models.tensorrt/optimized. Please note that the task will last several minutes and you must be patient. Next time you run this task it will be faster as only newest models will be generated. So, you can interrupt the process and next time it will continue from where it ended the last time.

For binaries/jetson_tftrt the prepare.sh script will also download Tensorflow libraries which means you'll need internet connection. You'll only need to run the script once.

Models generated on a Jetson device with Compute Capabilities X and TensorRT version Y will only be usable on devices matching this configuration. For example, you'll not be able to use models generated on Jetson TX2 (Compute Capabilities 6.2) on a Jetson nano (Compute Capabilities 5.3).

binaries/jetson versus binaries/jetson_tftrt

If you navigate to the binaries you'll see that there are 2 'jetson' folders: binaries/jetson and binaries/jetson_tftrt.

Feature	binaries/jetson	binaries/jetson_tftrt
License plate and car detection	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
License Plate Country Identification (LPCI)	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
Vehicle Color Recognition (VCR)	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
Vehicle Make Model Recognition (VMMR)	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
Vehicle Body Style Recognition (VBSR)	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
Vehicle Direction Tracking (VDT)	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
Vehicle Speed Estimation (VSE)	NVIDIA TensorRT GPGPU acceleration	NVIDIA TensorRT GPGPU acceleration
License Plate Recognition (LPR)	CPU	TF-TRT GPGPU acceleration (requires Tensorflow C++ libraries built with CUDA 10.2, cuDNN 8.0 and TensorRT 7+)

To make it short: Both versions use NVIDIA TensorRT GPGPU acceleration for the detection and classification while only binaries/jetson_tftrt uses GPGPU acceleration for License Plate Recognition (LPR) (a.k.a OCR).

binaries/jetson is very fast when parallel mode is enabled as we'll perform the detection and classification on GPU and the recognition/OCR on CPU. binaries/jetson_tftrt is faster as all operations (detection, classification, OCR...) are done on GPU.

For now we have failed to convert the License Plate Recognition (LPR) model to NVIDIA TensorRT and this is why we're using TF-TRT which comes with many issues: large binary size, high memory usage, slow load and initialization... We're working to have NVIDIA TensorRT for all models and completly remove Tensorflow.

Pros and Cons

binaries/jetson
- Pros:
  - Low memory usage (~20% on Jetson Nano)
  - Fast load and initialization
- Cons:
  - High CPU usage (> 300% out of 400% on Jetson Nano)
  - Lower frame rate when there are license plates on the image. License Plate Recognition (LPR) NOT GPGPU accelerated
binaries/jetson_tftrt
- Pros:
  - Low CPU usage (< 100% out of 400% on Jetson Nano)
  - Higher frame rate when there are license plates on the image. License Plate Recognition (LPR) IS GPGPU accelerated using TF-TRT
- Cons:
  - High memory usage (~50% on Jetson Nano). TF-TRT binaries are >500Mo and this doesn't help.
  - Slow load and initialization. For now we cannot generated the optimized models for the OCR part using TF-TRT, the models are built and optimized at runtime before inference.

To check GPU and CPU usage: /usr/bin/tegrastats

Recommendations

We recommend using binaries/jetson for your devs as it loads very fast and switch to binaries/jetson_tftrt for production. binaries/jetson_tftrt may be slow to load and initialize but once it's done the frame rate is higher.

On Jetson Xavier AGX, binaries/jetson may be faster than binaries/jetson_tftrt. Check issue #128 on why.

If binaries/jetson is still slow to load and initialize, then use binaries/linux/aarch64 which are a very light binaries using Tensorflow Lite (less than 13Mo total size).

Benchmark

Here are some benchmark numbers to compare the speed. For more information about the positive rate, please check https://www.doubango.org/SDKs/anpr/docs/Benchmark.html. The benchmark application is open source and could be found at samples/c++/benchmark.

Before running the benchmark application:

For Jetson nano, make sure you're using a Barrel Jack (5V-4A) power supply instead of microUSB port (5V-2A)
Put the device on maximum performance mode: sudo nvpmodel -m 0 && sudo jetson_clocks.

To run the benchmark application for binaries/jetson with 0.2 positive rate for 100 loops:

cd ulatimateALPR-SDK/binaries/jetson/aarch64
chmod +x benchmark
LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./benchmark \
    --positive ../../../assets/images/lic_us_1280x720.jpg \
    --negative ../../../assets/images/london_traffic.jpg \
    --assets ../../../assets \
    --charset latin \
    --loops 100 \
    --rate 0.2 \
    --parallel true

	0.0 rate	0.2 rate	0.5 rate	0.7 rate	1.0 rate
binaries/jetson_tftrt (Xavier NX, JetPack 4.4.1)	657 millis 152.06 fps	967 millis 103.39 fps	1280 millis 78.06 fps	1539 millis 64.95 fps	1849 millis 54.07 fps
binaries/jetson (Xavier NX, JetPack 4.4.1)	657 millis 152.02 fps	1169 millis 85.47 fps	2112 millis 47.34 fps	2703 millis 36.98 fps	3628 millis 27.56 fps
binaries/linux/aarch64 (Xavier NX, JetPack 4.4.1)	7498 millis 13.33 fps	8281 millis 12.07 fps	9421 millis 10.61 fps	10161 millis 9.84 fps	11006 millis 9.08 fps
binaries/jetson_tftrt (TX2, JetPack 4.4.1)	1420 millis 70.38 fps	1653 millis 60.47 fps	1998 millis 50.02 fps	2273 millis 43.97 fps	2681 millis 37.29 fps
binaries/jetson (TX2, JetPack 4.4.1)	1428 millis 70.01 fps	1712 millis 58.40 fps	2165 millis 46.17 fps	2692 millis 37.13 fps	3673 millis 27.22 fps
binaries/linux/aarch64 (TX2, JetPack 4.4.1)	4591 millis 21.77 fps	4722 millis 21.17 fps	5290 millis 18.90 fps	7154 millis 13.97 fps	10032 millis 9.96 fps
binaries/jetson_tftrt (Nano, JetPack 4.4.1)	3106 millis 32.19 fps	3292 millis 30.37 fps	3754 millis 26.63 fps	3967 millis 25.20 fps	4621 millis 21.63 fps
binaries/jetson (nano, JetPack 4.4.1)	2920 millis 34.24 fps	3083 millis 32.42 fps	3340 millis 29.93 fps	3882 millis 25.75 fps	5102 millis 19.59 fps
binaries/linux/aarch64 (Nano, JetPack 4.4.1)	4891 millis 20.44 fps	6950 millis 14.38 fps	9928 millis 10.07 fps	11892 millis 8.40 fps	14870 millis 6.72 fps

You can notice that binaries/jetson and binaries/jetson_tftrt have the same fps when positive rate is 0.0 (no plate in the stream) but the gap widen when the rate increase (more plates in the stream). This can be explained by the fact that both use NVIDIA TensorRT to accelerate the license plate and car detection but only binaries/jetson_tftrt uses GPGPU acceleration for License Plate Recognition (LPR) (thanks to TF-TRT).

binaries/linux/aarch64 contains generic Linux binaries for AArch64 (a.k.a ARM64) devices. All operations are done on CPU. The performance boost between this CPU-only version and the Jetson-based ones may not seem impressive but there is a good reason: binaries/linux/aarch64 uses INT8 inference while the Jetson-based versions use a mix of FP32 and FP16 which means more accurate. Providing INT8 models for Jetson devices is on our roadmap with no ETA.

Jetson nano versus Raspberry Pi 4

On average the SDK is 3 times faster on Jetson nano compared to Raspberry Pi 4 and this may not seem impressive but there is a good reason: binaries/raspbian/armv7l uses INT8 inference while the Jetson-based binaries (binaries/jetson and binaries/jetson_tftrt) use a mix of FP32 and FP16 which means more accurate. Providing INT8 models for Jetson devices is on our roadmap with no ETA.

Jetson Xavier NX versus Jetson TX2

Jetson Xavier NX and Jetson TX2 are proposed at the same price ($399) but NX has 4.6 times more compute power than TX2 for FP16: 6 TFLOPS versus 1.3 TFLOPS.

We highly recommend using Xavier NX instead of TX2.

NX (€342): https://www.amazon.com/NVIDIA-Jetson-Xavier-Developer-812674024318/dp/B086874Q5R
TX2: (€343): https://www.amazon.com/NVIDIA-945-82771-0000-000-Jetson-TX2-Development/dp/B06XPFH939

Pre-processing operations

Please note that even when your're using binaries/jetson_tftrt some pre-processing operations are performed on CPU and this why the CPU usage is at 1/5th. You don't need to worry about these operations, they are massively multi-threaded and entirely written in assembler with SIMD NEON acceleration. These functions are open source and you can find them at:

Normalization: compv_math_op_sub_arm64_neon.S
Chroma Conversion (YUV -> RGB): compv_image_conv_to_rgbx_arm64_neon.S
Type conversion (UINT8 -> FLOAT32): compv_math_cast_arm64_neon.S
Packing/Unpacking: compv_mem_arm64_neon.S
Scaling: compv_image_scale_bilinear_arm64_neon.S
...

Coming next

Version 3.1.0 is the first release to support NVIDIA Jetson and there is room for optimizations. Adding support for full INT8 inference could improve the speed by up to 700%. We're also planing to move the NMS layer from the GPU to the CPU and rewrite the code in assembler with NEON SIMD.

Known issues and possible fixes

Failed to open file

You may receive [UltAlprSdkTRT] Failed to open file error after running ./prepare.sh script if we fail to write to the local disk. We recommend running the script as root(#) instead of normal user($).

Slow load and initialization

When your're using binaries/jetson_tftrt the OCR models are built using CUDA engines at runtime before running the inference. Building the models is very slow and not suitable in dev stage. We recommend using binaries/jetson for your devs as it loads very fast and switch to binaries/jetson_tftrt for production. binaries/jetson_tftrt may be slow to load and initialize but once it's done the frame rate is higher.

High memory usage

Disabling parallel mode alone could decrease the memory usage by 50%. Off course disabling parallel mode will slowdown the frame rate by up to 60%.

binaries/jetson_tftrt is the faster version but it depends on TF-TRT which is very large (>500Mo). Try with binaries/jetson which is very small (less than 13Mo total size).

High CPU usage

binaries/jetson uses the CPU for the OCR part. Use binaries/jetson_tftrt for full GPGPU acceleration to significantly reduce CPU usage.

Technical questions

Please check our discussion group or twitter account

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jetson.md

Jetson.md

Getting started

Requirements

Before trying to use the SDK on Jetson

Building optimized models

binaries/jetson versus binaries/jetson_tftrt

Pros and Cons

Recommendations

Benchmark

Jetson nano versus Raspberry Pi 4

Jetson Xavier NX versus Jetson TX2

Pre-processing operations

Coming next

Known issues and possible fixes

Failed to open file

Slow load and initialization

High memory usage

High CPU usage

Technical questions

Files

Jetson.md

Latest commit

History

Jetson.md

File metadata and controls

Getting started

Requirements

Before trying to use the SDK on Jetson

Building optimized models

binaries/jetson versus binaries/jetson_tftrt

Pros and Cons

Recommendations

Benchmark

Jetson nano versus Raspberry Pi 4

Jetson Xavier NX versus Jetson TX2

Pre-processing operations

Coming next

Known issues and possible fixes

Failed to open file

Slow load and initialization

High memory usage

High CPU usage

Technical questions