Training an object detector

Object detection is the task of finding objects into an image and labeling them.

The output of an object classifier is a list of objects with for every detected object:

  • Coordinates of the bounding box that encloses the object. A bounding box is described as two points, the top-left corner and the lower-right corner of a a rectangle bounding box.

  • Estimated label for the object, e.g. cat

Data format

Object location text file format (required for every image):

<label> <xmin> <ymin> <xmax> <ymax>

Example of object location text file for the image below

Example image

file bbox_img_3333.txt

1 3086 1296 3623 1607
1 2896 1340 3205 1539
1 2519 1326 2694 1427
1 2330 1197 2580 1392
1 1781 1306 1885 1390
1 2013 1285 2057 1325
1 2108 1252 2175 1333
1 2161 1292 2278 1348
1 252 1266 627 1454
1 620 1285 799 1376
  • 1 indicates class number 1, here a car
  • Coordinates are in pixel wrt the original image size

Object detection main image list format:

/path/to/image.jpg /path/to/bbox_file_image.txt

Object detection main image list example from /opt/platform/examples/cars/train.txt:

/opt/platform/examples/cars/imgs//youtube_frames/toronto-main-street-000147.jpg /opt/platform/examples/cars/bbox//toronto-main-street-000147.txt
/opt/platform/examples/cars/imgs//youtube_frames/crazy-000022.jpg /opt/platform/examples/cars/bbox//crazy-000022.txt
/opt/platform/examples/cars/imgs//youtube_frames/mass6-000363.jpg /opt/platform/examples/cars/bbox//mass6-000363.txt
/opt/platform/examples/cars/imgs//normal_rgb_images/tme17/Right/010475-R.jpg /opt/platform/examples/cars/bbox//010475-R.txt

We suggest organizing the dataset files as follows:


The DD platform has the following requirements for training from images for object detection:

  • All data must be in image format, most encoding supported (e.g. png, jpg, …)
  • For every image there’s a text file describing the class and location of objects in the image. See format on the right. If no bounding boxes for an image, create an empty text file.

If you receive exception while forward/backward pass through the network and it’s not due to memory or other problems, check that the number n_classes is your expected number of classes plus 1.

  • A main text file lists all image paths and their object location file counterpart, using space as a separator. See on the right for data format and example.

  • You need to prepare both a train.txt and test.txt file for training and testing purposes.

DD platform comes with a custom Jupyter UI that allows testing your object detection dataset prior to training:

Object detection data check in DD platform Jupyter UI

Training an object detector

Using the DD platform, from a JupyterLab notebook, start from the code on the right.

Object detection notebook snippet:

img_obj_detect = Detection(
  training_repo= "/opt/platform/examples/cars/train.txt",
  testing_repo= "/opt/platform/examples/cars/test.txt",

This prepares for training an object detector with the following parameters:

  • cars is the example job name
  • training_repo specifies the location of the data
  • template specifies an SSD-300 architecture that is fast and has good accuracy. See the recommended models section.

  • img_width and img_height specify the input size of the image, see the recommended models section to adapt to other architectures available.

  • db_width and db_height specify the image input size from which the data augmentation is applied during training. Typically zooming and distorsions yield more accurate and robust models. A good rule of thumb is to use roughly twice the size of the architecture input size (e.g. 300x300 -> 512x512 and 512x512 -> 1024x1024).

  • mirror activates mirroring of inputs as data augmentation for both the input image and the bounding box

  • rotate activates rotation of inputs as data augmentation for both the input image and the bounding box (e.g. useful for satellite images, …)

  • finetune automatically prepares the network architecture for finetuning

  • weights specifies the pre-trained model weights to start training from

  • solver_type specifies the optimizer, see and solver_type for the many options

  • base_lr specifies the learning rate. For finetuning object detection models, 1e-4 works well.

  • gpuid specifies which GPU to use, starting with number 0

The platform has many neural network architectures and pre-trained models built-in for object detection. These range from state of the art architectures like SSD, SSD with resnet tips, RefineDet for state of the art, to low-memory Squeezenet-SSD and Mobilenet-SSD.

Below is a list of recommended models for image classification from which to best choose for your task.

Model Image size Recommendation Pre-Trained (/opt/platform/models/pretrained)
ssd_300 300x300 Very Fast / Good accuracy / embedded & desktops ssd_300/VGG_rotate_generic_detect_v2_SSD_rotate_300x300_iter_115000.caffemodel
ssd_300_res_128 300x300 Fast / Very good accuracy / desktops ssd_300_res_128/VGG_fix_pretrain_ilsvrc_res_pred_128_generic_detect_v2_SSD_fix_pretrain_ilsvrc_res_pred_128_300x300_iter_184784.caffemodel
ssd_512 512x512 Fast / Very good accuracy / desktops ssd_512/VGG_fix_512_generic_detect_v2_SSD_fix_512_512x512_iter_180000.caffemodel
refinedet_512 512x512 Fast / Excellent accuracy / desktops refinedet_512/VOC0712_refinedet_vgg16_512x512_iter_120000.caffemodel
squeezenet_ssd 300x300 Extremely Fast / Good accuracy / embedded squeezenet_ssd/SqueezeNet_generic_detect_v2_SqueezeNetSSD_300x300_iter_200000.caffemodel

Download the pretrained weights file for these models:

For a full list of available templates and models, see