Training a model from a dataset in CSV format

This tutorial walks you through the training and using of a machine learning neural network model to estimate the tree cover type based on tree data. This makes use of the well-known ‘Cover Type’ dataset, as presented in the Kaggle competition https://www.kaggle.com/c/forest-cover-type-prediction.

In summary, a CSV file contains numerical data about patches of forest land, and we will build a model that esimate the cover type of the patch, from 7 categories (e.g. spruce/fir, aspen, …). See https://www.kaggle.com/c/forest-cover-type-prediction/data for an explanation of the data themselves.

Getting the dataset

Let us create a dedicated repository

mkdir models
mkdir models/covert

The data can be obtained either from Kaggle or from http://juban.free.fr/dd/examples/all/forest_type/train.csv.tar.bz2

cd models/covert
wget http://juban.free.fr/dd/examples/all/forest_type/train.csv.tar.bz2
tar xvjf train.csv.tar.bz2

You can take a look at the raw data:

head -n 5 train.csv

The field ‘Id’ contains every training example id, and ‘Cover_Type’ holds the reference label, i.e. the forest cover type between 1 and 7. There are a total of 15120 training examples.

Creating the machine learning service

The first step with DeepDetect is to start the server:

./dede

and create a machine learning service that uses a multilayered perceptron with 150 hidden neurons in 3 layers, and using prelu activations:

curl -X PUT "http://localhost:8080/services/covert" -d "{\"mllib\":\"caffe\",\"description\":\"forest classification service\",\"type\":\"supervised\",\"parameters\":{\"input\":{\"connector\":\"csv\"},\"mllib\":{\"template\":\"mlp\",\"nclasses\":7,\"layers\":[150,150,150],\"activation\":\"prelu\"}},\"model\":{\"templates\":\"../templates/caffe/\",\"repository\":\"models/covert\"}}"

yields:

{"status":{"code":201,"msg":"Created"}}

Training and testing the service

Let us now train a statistical model in the form of the neural network defined above. Below is a full API call for launching an asynchronous training call on the GPU. Take a look at it, and before proceeding with the call, let us review the call in details below.

curl -X POST "http://localhost:8080/train" -d "{\"service\":\"covert\",\"async\":true,\"parameters\":{\"mllib\":{\"gpu\":true,\"solver\":{\"iterations\":1000,\"test_interval\":100},\"net\":{\"batch_size\":512}},\"input\":{\"label_offset\":-1,\"label\":\"Cover_Type\",\"id\":\"Id\",\"separator\":\",\",\"shuffle\":true,\"test_split\":0.1,\"scale\":true},\"output\":{\"measure\":[\"acc\",\"mcll\",\"f1\"]}},\"data\":[\"models/covert/train.csv\"]}"

First and foremost, we are using our newly created service to train a model. This means that our service will be busy for some time, and we cannot use it for anything else but reviewing the training call status and progress. Other services, if any, would remain available of course. In more details here:

async allows to start a non-blocking (i.e. asynchronous call)
gpu allows to tell the server we would like to use the GPU. Importantly note that in the absence of GPU, the server will automatically fall back on the CPU, without warning
iterations is the number of training iterations after which the training will terminate automatically. Until termination it is possible to get the status and progress of the call, as we will demonstrate below
label_offset tells the CSV input connectors that the label identifiers run from 1 to 7 instead of 0 to 6. This is required here in order to not miss a class
label identifies the reference label column from the CSV dataset
‘id’ is the column identifier of the samples
test_split tells the input connector to keep 90% of the training set of training and 10% for assessing the quality of the model being built
shuffle tells the input connector to shuffle both the training and testing sets, this is especially useful for cross validation
scale tells the input connector to scale all data within [0,1] in order to get similar sensitivity across all dimensions. This usually helps the optimization procedure that underlies learning a neural net.
measures lists the assessment metrics of the model being built, acc is for accuracy, mcll for multi-class log loss and f1 for F1-score
data holds the dataset file

For more details on the training phase options and parameters, see the API.

Let us now run the call above, the immediate anser is:

{"status":{"code":201,"msg":"Created"},"head":{"method":"/train","job":1,"status":"running"}}

indicating that the call was successful and the training is now running.

You can get the status of the call anytime with another call:

curl -X GET "http://localhost:8080/train?service=covert&job=1"

{"status":{"code":200,"msg":"OK"},"head":{"method":"/train","job":1,"status":"finished","time":61.0},"body":{"measure":{"train_loss":0.6788941025733948,"mcll":0.6393973624892094,"recall":0.7269925057270527,"iteration":999.0,"precision":0.7266408723876882,"f1":0.7268166465273875,"accp":0.7275132275132276,"acc":0.7275132275132276}}

TODO: scaling parameters

Here the final quality of the model can be read as 72.75% accuracy on the testing portion of the dataset. In order to train a much better model, you can increase the number of iterations and the batch size, as well as play with the number of layers and their size. Typically, training the perceptron above for 10000 iterations with batch_size 5210 would yield an accuracy between 82% and 84%.

The status call can be repeated as needed until the status indicates that the training is finished, after which the job is deleted.

The trained model is now available on disk in the models/covert repository. If you turn the server off or delete the service without wiping out the files, you will be able to use the trained model from another service.

However for now we show below how to use the current service and model to make prediction from new data.

Prediction for new data

The service is ready for the predict resource of the API to be used.

Prediction from file

The test data file can be obtained either from Kaggle or from http://juban.free.fr/dd/examples/all/forest_type/test.csv.tar.bz2

cd models/covert
wget http://juban.free.fr/dd/examples/all/forest_type/test.csv.tar.bz2
tar xvjf test.csv.tar.bz2

The full test set has 565892 samples, so let us lower this to a 10 samples (plus the header line) so we can inspect the results more easily:

head -n 11 test.csv > test10.csv

TODO: scaling parameters

and make a predict call:

curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"covert\",\"parameters\":{\"input\":{\"id\":\"Id\",\"separator\":\",\",\"scale\":true}},\"data\":[\"models/covert/test10.csv\"]}"

{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":16.0,"service":"covert"},"body":{"predictions":[{"uri":"15121","loss":0.0,"classes":{"prob":0.9999997615814209,"cat":"6"}},{"uri":"15122","loss":0.0,"classes":{"prob":0.9962882995605469,"cat":"5"}},{"uri":"15130","loss":0.0,"classes":{"prob":0.9999340772628784,"cat":"1"}},{"uri":"15123","loss":0.0,"classes":{"prob":1.0,"cat":"3"}},{"uri":"15124","loss":0.0,"classes":{"prob":1.0,"cat":"3"}},{"uri":"15128","loss":0.0,"classes":{"prob":1.0,"cat":"1"}},{"uri":"15125","loss":0.0,"classes":{"prob":0.9999998807907105,"cat":"3"}},{"uri":"15126","loss":0.0,"classes":{"prob":0.7535045146942139,"cat":"3"}},{"uri":"15129","loss":0.0,"classes":{"prob":0.9999986886978149,"cat":"1"}},{"uri":"15127","loss":0.0,"classes":{"prob":1.0,"cat":"1"}}]}}

In the results above: * uri is the Id of the sample in the test set * prob is the probability associated to the predicted class with highest probability

So for instance, sample 15121 was predicted as being of forest cover type 6 with probability 0.99. Do not forget that we did use a label_offset when training the service. So 6 here corresponds to class 7 on page https://www.kaggle.com/c/forest-cover-type-prediction/data, which is a Krummholz cover type.

Prediction from in-memory data

curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"covert\",\"parameters\":{\"input\":{\"connector\":\"csv\",\"scale\":true,\"min_vals\":[1863,0,0,0,-146,0,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],\"max_vals\":[3849,360,52,1343,554,6890,254,254,248,6993,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}},\"data\":[\"2499,0,9,150,55,1206,207,223,154,859,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\"]}"

TODO: scaling parameters + no header + no id

DeepDetect v0.1 documentation