Speed-up ONNX models with TensorRT

ONNX models

ONNX is a great initiative to standardize the structure and storage of deep neural networks.

Almost all frameworks export to ONNX in one way or another. Implementations may then vary, though it’s expected to converge eventually. ONNX models are of great interest since:

They can easily be exported around with their weights
They can be optimized and converted to a variety of CPUs and GPUs

Both interoperability and hardware access are key to replace or enhance modern software stacks with deep neural networks.

Learn how to import an ONNX model into #TensorRT, apply optimizations, and generate a high-performance runtime engine for the datacenter environment through this tutorial from @nvidia. https://t.co/HibMhhQpgn
— ONNX (@onnxai) December 26, 2018

Natively #PyTorch trained EfficientNet-B0, MobileNetV3, MnasNet-A1, MnasNet-B1, FBNet-C. Top-1 accuracies at or better than paper spec or original impl. All cleanly exportable to ONNX and TensorRT to try and compare on your embedded devices. https://t.co/NMvRUrBYFp
— Ross Wightman (@wightmanr) June 13, 2019

It is not always obvious how to optimize an ONNX model for production on GPUs.

ONNX models on NVidia GPUs

DeepDetect has support for image classification models in ONNX format on NVidia GPUs. To do so, DeepDetect automatically takes the ONNX model and compiles it into TensorRT format for inference.

This is very useful since it does not require writing code of any sort. It can all be done with a two calls from the DD Server REST API.

Under the hood there are two steps:

The ONNX model is passed to a parser that compiles it into NVidia TensorRT format. This format is optimized for NVidia GPU internals.
An heuristic inference engine then empirically selects the best parameters for the compiled TensorRT model. These parameters depend on the GPU, the batch size (e.g. the number of images to process in parallel) and the available memory.

ONNX image classification inference on GPU

DeepDetect makes this step easy. The final model uses float16 (fp16) for even faster inference.

Service creation via the REST API comes first

curl -X PUT http://localhost:8080/services/testonnx -d '
{
"description": "image classification",
"mllib": "tensorrt",
"model": {
    "init": "https://deepdetect.com/models/init/desktop/images/classification/resnet_50_onnx.tar.gz",
	"repository": "/dest/path/to/model/"
},
"parameters": {
    "input": {
        "connector": "image",
        "height": 224,
        "rgb": true,
        "scale": 0.0039,
        "width": 224
    },
    "mllib": {
        "gpuid": 0,
        "maxBatchSize": 128,
	    "datatype": "fp16",
        "maxWorkspaceSize": 6096,
        "nclasses": 1000
    }
},
"type": "supervised"
}
'

A second call passes an image to the model and triggers the TensorRT automated conversion

curl -X POST http://localhost:8080/predict -d '
"{
"data": [
    "/path/to/image.jpg"
],
"parameters": {
    "input": {
        "height": 224,
        "width": 224
    },
    "output": {
        "best": 1
    }
},
"service": "testonnx"
}
"
'

Subsequent calls use the optimized model, processing a single image should take below 30ms on a desktop GPU.

Blog

Speed-up ONNX models with TensorRT

18 December 2020

ONNX models

ONNX models on NVidia GPUs

ONNX image classification inference on GPU