Failures & Debug
Checking error status
Your training job has failed, follow the steps below in order to start again properly:
- Let’s make sure training has failed by checking on the Jupyter job status, an error would look like this:
Debugging
Many errors come from a wrongly initialized model directory, and this especially happens after stopping a training job, and restarting it. To fix this issue, try the following:
From your DD platform Jupyter training job notebook, click on
Hard clear
Click on
Delete service
Error says ‘CudaSuccess error`: this means the GPU used for training does not have enough memory, make sure to:
Check the occupancy of the GPU from the DD platform UI, maybe someone else is using it.
Lower your
batch_size
andtest_batch_size
. If yourbatch_size
was 32, setbatch_size
to 16, and setiter_size
to 2: this is equivalent to a 32 batch size, by unrolling it into two passes.
DD platform Jupyter status says
error
, follow the following steps to uncover the error code and message:Go to the
Logs
tab and look for error messagesGo to the
Widgets.log
tab, clickRefresh
and look for error messages