Training from CSV data

The DD platform can train from CSV (and SVM) format.

Data format

Any CSV file with:

Text fields
Numeric fields
Categorical fields (i.e. a finite set of values, string or numbers, e.g. cities, postal codes, etc…)

You can specify both a training and a testing CSV or letting the DD platform shuffle and splitting the test set with the tsplit parameter.

The DD platform comes with a custom Jupyter UI that allows you to review your data:

CSV data review in DD platform Jupyter UI

See https://deepdetect.com/tutorials/csv-training/ for some more details.

Training from CSV

DD automatically parses and manages the CSV file and data, including the handling of categorical variables and the normalization of input data.

Using the DD platform, from a JupyterLab notebook, start from the code on the right.

csv_train_job = CSV(
  'forest',
  host='deepdetect_training',
  port=8080,
  training_repo="/opt/platform/examples/covert/train.csv",
  model_repo="/opt/platform/models/training/JeanDupont/covert",
  csv_label='Cover_Type',
  csv_id='Id',
  csv_separator=',',
  tsplit=0.2,
  template='mlp',
  layers='[150,150,150]',
  activation="prelu",
  nclasses=7,
  scale= True,
  iterations=10000,
  base_lr=0.001,
  solver_type="AMSGRAD"
)
csv_train_job

This prepares a training job for a 3 layers neural network (MLP) with 150 neurons in every layer.

forest is the example job name

Job names must be unique, replace `forest` with your own

Using job names that reflect your experiments is good practice, e.g. `forest_mlp_150_iter10k_amsgrad`

training_repo specifies the location of the data

Using your own data, set `training_repo` to `/opt/platform/data/JeanDupont/your_dataset/train.csv`, and replace `JeanDupont` with your name

csv_id specifies, when it exists, which of the CSV columns is the identifier of the samples
csv_label specifies which column holds the label of the samples
tsplit specifies the part of the training set used for testing (0.2 for 20%)
templates specifies an MLP, i.e a simple neural network
layers specifies 3 layers of 150 hidden neurons each
prelu specifies PReLU activations
nclasses specifies the number of classes of the problem
solver_type specifies the optimizer, see https://deepdetect.com/api/#launch-a-training-job and solver_type for the many options
base_lr specifies the learning rate. For finetuning object detection models, 1e-4 works well.
gpuid specifies which GPU to use, starting with number 0

Training from CSV data

Data format

Training from CSV

Related