Training jobs & Monitoring
This section goes step by step into generic instructions for launching and monitoring training jobs.
Launching a training job
First you have to setup your training job and verified your data are correctly setup using the DD platform custom Jupyter tooling.
To launch a training job, use the Run training
button:
Training job setup
Every training job appears into the UI as a badge on the ‘Training’ page
For some training jobs, an internal setup, e.g. pre-processing the full dataset, can take a few minutes. In that case, the metrics may not appear immediately, and the badge may look like this for a moment:
Training job monitoring
Training jobs can last from minutes to several days. For this reason the DD platform yields a few tools for monitoring the running jobs:
- ‘Training’ section of the UI reports on all the currently training jobs:
- The ‘Monitor’ button on the training badge yields metrics and details on the run:
- The DD platform custom Jupyter notebook screens the progression, status and remaining time of the training job:
- The DD platform custom Jupyter notebook allows fine-grained monitoring of the job calls, via the
Logs
tab:
- For even fine-grained information, the
Widget logs
tab can be refreshed when needed. It captures all calls between Jupyter and the Deep Learning server:
Stopping a training job
To stopping a training job, use the ‘Delete Service’ button: