TensorRT can combine same layers to reduce the GPU resource and times of reading and writing.
TensorRT converts to optimized matrix math depending on the specified precision (FP32, FP16 or INT8) for improved latency, throughput, and efficiency.
Of course, it will decrease a little bit accuracy, but TensorRT is very clever that it will determine where is suitable for adjustments.
Automatically adjust CUDA cores for different GPU platforms, algorithms, models.
During using each tensor, TensorRT will designate video memory for it to avoid repeated application of video memory, then reducing memory consumption and improving reuse efficiency.
Scalable design to process multiple input streams in parallel, optimized for the underlying GPU.
If the model calculation speeds up and the memory usage decreases, more complex models (i.e., better-performing models) can be used instead of the modified models.
You can check the official website from here .
Get the trained model
Convert the .pb model (e.g., TensorFlow)
to UFF format
Build the TensorRT engine
Execute TensorRT to inference
(If your model is trained by Tensorflow,
you need to covert to UFF format.)
However, sometimes it is difficult to convert because you should write the plugin. So another way is converting the pb model to onnx format, then use Onnx Parser API.
convert-to-uff frozen_inference_graph.pb -O NMS -p config.py