Training jobs & Monitoring

This section goes step by step into generic instructions for launching and monitoring training jobs.

Launching a training job

First you have to setup your training job and verified your data are correctly setup using the DD platform custom Jupyter tooling.

To launch a training job, use the Run training button:

Run training button

Training job setup

Every training job appears into the UI as a badge on the ‘Training’ page

Training badge

For some training jobs, an internal setup, e.g. pre-processing the full dataset, can take a few minutes. In that case, the metrics may not appear immediately, and the badge may look like this for a moment:

Training badge setup

Training job monitoring

Training jobs can last from minutes to several days. For this reason the DD platform yields a few tools for monitoring the running jobs:

  • ‘Training’ section of the UI reports on all the currently training jobs:

All training jobs

  • The ‘Monitor’ button on the training badge yields metrics and details on the run:

Monitor training

  • The DD platform custom Jupyter notebook screens the progression, status and remaining time of the training job:

Progression in Jupyter

  • The DD platform custom Jupyter notebook allows fine-grained monitoring of the job calls, via the Logs tab:

Logs tab

  • For even fine-grained information, the Widget logs tab can be refreshed when needed. It captures all calls between Jupyter and the Deep Learning server:

Widget logs tab

Stopping a training job

To stopping a training job, use the ‘Delete Service’ button:

Monitor training