Training an object detector

Object detection is the task of finding objects into an image and labeling them.

The output of an object classifier is a list of objects with for every detected object:

  • Coordinates of the bounding box that encloses the object. A bounding box is described as two points, the top-left corner and the lower-right corner of a a rectangle bounding box.

  • Estimated label for the object, e.g. cat

Data format

Object location text file format (required for every image):

<label> <xmin> <ymin> <xmax> <ymax>

Example of object location text file for the image below

Example image

file bbox_img_3333.txt

1 3086 1296 3623 1607
1 2896 1340 3205 1539
1 2519 1326 2694 1427
1 2330 1197 2580 1392
1 1781 1306 1885 1390
1 2013 1285 2057 1325
1 2108 1252 2175 1333
1 2161 1292 2278 1348
1 252 1266 627 1454
1 620 1285 799 1376
  • 1 indicates class number 1, here a car
  • Coordinates are in pixel wrt the original image size

Object detection main image list format:

/path/to/image.jpg /path/to/bbox_file_image.txt

Object detection main image list example from /opt/platform/examples/cars/train.txt:

/opt/platform/examples/cars/imgs//youtube_frames/toronto-main-street-000147.jpg /opt/platform/examples/cars/bbox//toronto-main-street-000147.txt
/opt/platform/examples/cars/imgs//youtube_frames/crazy-000022.jpg /opt/platform/examples/cars/bbox//crazy-000022.txt
/opt/platform/examples/cars/imgs//youtube_frames/mass6-000363.jpg /opt/platform/examples/cars/bbox//mass6-000363.txt
/opt/platform/examples/cars/imgs//normal_rgb_images/tme17/Right/010475-R.jpg /opt/platform/examples/cars/bbox//010475-R.txt

We suggest organizing the dataset files as follows:


The DD platform has the following requirements for training from images for object detection:

  • All data must be in image format, most encoding supported (e.g. png, jpg, …)
  • For every image there’s a text file describing the class and location of objects in the image. See format on the right. If no bounding boxes for an image, create an empty text file.
  • A main text file lists all image paths and their object location file counterpart, using space as a separator. See on the right for data format and example.

  • You need to prepare both a train.txt and test.txt file for training and testing purposes.

DD platform comes with a custom Jupyter UI that allows testing your object detection dataset prior to training:

Object detection data check in DD platform Jupyter UI

Training an object detector

Using the DD platform, from a JupyterLab notebook, start from the code on the right.

Object detection notebook snippet:

img_obj_detect = Detection(
  training_repo= "/opt/platform/examples/cars/train.txt",
  testing_repo= "/opt/platform/examples/cars/test.txt",

This prepares for training an object detector with the following parameters:

  • cars is the example job name
  • training_repo specifies the location of the data
  • template specifies an SSD-300 architecture that is fast and has good accuracy. See the recommended models section.

  • img_width and img_height specify the input size of the image, see the recommended models section to adapt to other architectures available.

  • db_width and db_height specify the image input size from which the data augmentation is applied during training. Typically zooming and distorsions yield more accurate and robust models. A good rule of thumb is to use roughly twice the size of the architecture input size (e.g. 300x300 -> 512x512 and 512x512 -> 1024x1024).

  • mirror activates mirroring of inputs as data augmentation for both the input image and the bounding box

  • rotate activates rotation of inputs as data augmentation for both the input image and the bounding box (e.g. useful for satellite images, …)

  • finetune automatically prepares the network architecture for finetuning

  • weights specifies the pre-trained model weights to start training from

  • solver_type specifies the optimizer, see and solver_type for the many options

  • base_lr specifies the learning rate. For finetuning object detection models, 1e-4 works well.

  • gpuid specifies which GPU to use, starting with number 0

The platform has many neural network architectures and pre-trained models built-in for object detection. These range from state of the art architectures like SSD, SSD with resnet tips, RefineDet for state of the art, to low-memory Squeezenet-SSD and Mobilenet-SSD.

Below is a list of recommended models for image classification from which to best choose for your task.

Model Template Image size Pre-Trained (/opt/platform/models/pretrained) Recommendation
SSD-300 300x300 ssd_300 ssd_300/VGG_rotate_generic_detect_v2_SSD_rotate_300x300_iter_115000.caffemodel Very Fast / Good accuracy / embedded & desktops
SSD-300-res-128 300x300 ssd_300_res_128 ssd_300_res_128/VGG_fix_pretrain_ilsvrc_res_pred_128_generic_detect_v2_SSD_fix_pretrain_ilsvrc_res_pred_128_300x300_iter_184784.caffemodel Fast / Very good accuracy / desktops
SSD-512 512x512 ssd_512 ssd_512/VGG_fix_512_generic_detect_v2_SSD_fix_512_512x512_iter_180000.caffemodel Fast / Very good accuracy / desktops
RefineDet-512 512x512 refinedet_512 refinedet_512/VOC0712_refinedet_vgg16_512x512_iter_120000.caffemodel Fast / Excellent accuracy / desktops
SqueezeNet-SSD 300x300 squeezenet_ssd squeezenet_ssd/SqueezeNet_generic_detect_v2_SqueezeNetSSD_300x300_iter_200000.caffemodel Extremely Fast / Good accuracy / embedded

For a full list of available templates and models, see