Speed-up ONNX models with TensorRT

18 December 2020

ONNX models

ONNX is a great initiative to standardize the structure and storage of deep neural networks.

Almost all frameworks export to ONNX in one way or another. Implementations may then vary, though it’s expected to converge eventually. ONNX models are of great interest since:

  • They can easily be exported around with their weights
  • They can be optimized and converted to a variety of CPUs and GPUs

Both interoperability and hardware access are key to replace or enhance modern software stacks with deep neural networks.

It is not always obvious how to optimize an ONNX model for production on GPUs.

ONNX models on NVidia GPUs

DeepDetect has support for image classification models in ONNX format on NVidia GPUs. To do so, DeepDetect automatically takes the ONNX model and compiles it into TensorRT format for inference.

This is very useful since it does not require writing code of any sort. It can all be done with a two calls from the DD Server REST API.

Under the hood there are two steps:

  1. The ONNX model is passed to a parser that compiles it into NVidia TensorRT format. This format is optimized for NVidia GPU internals.
  2. An heuristic inference engine then empirically selects the best parameters for the compiled TensorRT model. These parameters depend on the GPU, the batch size (e.g. the number of images to process in parallel) and the available memory.

ONNX image classification inference on GPU

DeepDetect makes this step easy. The final model uses float16 (fp16) for even faster inference.

  • Service creation via the REST API comes first

    curl -X PUT http://localhost:8080/services/testonnx -d '
    "description": "image classification",
    "mllib": "tensorrt",
    "model": {
        "init": "",
    	"repository": "/dest/path/to/model/"
    "parameters": {
        "input": {
            "connector": "image",
            "height": 224,
            "rgb": true,
            "scale": 0.0039,
            "width": 224
        "mllib": {
            "gpuid": 0,
            "maxBatchSize": 128,
    	    "datatype": "fp16",
            "maxWorkspaceSize": 6096,
            "nclasses": 1000
    "type": "supervised"
  • A second call passes an image to the model and triggers the TensorRT automated conversion

    curl -X POST http://localhost:8080/predict -d '
    "data": [
    "parameters": {
        "input": {
            "height": 224,
            "width": 224
        "output": {
            "best": 1
    "service": "testonnx"
  • Subsequent calls use the optimized model, processing a single image should take below 30ms on a desktop GPU.