Failures & Debug

Checking error status

Your training job has failed, follow the steps below in order to start again properly:

  • Let’s make sure training has failed by checking on the Jupyter job status, an error would look like this:

Job error

Debugging

  • Many errors come from a wrongly initialized model directory, and this especially happens after stopping a training job, and restarting it. To fix this issue, try the following:

    • From your DD platform Jupyter training job notebook, click on Hard clear

    • Click on Delete service

  • Error says ‘CudaSuccess error`: this means the GPU used for training does not have enough memory, make sure to:

    • Check the occupancy of the GPU from the DD platform UI, maybe someone else is using it.

    • Lower your batch_size and test_batch_size. If your batch_size was 32, set batch_size to 16, and set iter_size to 2: this is equivalent to a 32 batch size, by unrolling it into two passes.

  • DD platform Jupyter status says error, follow the following steps to uncover the error code and message:

    • Go to the Logs tab and look for error messages

    • Go to the Widgets.log tab, click Refresh and look for error messages

Related