Training a model from a dataset in CSV format
This tutorial walks you through the training and using of a machine learning neural network model to estimate the tree cover type based on tree data. This makes use of the well-known ‘Cover Type’ dataset, as presented in the Kaggle competition https://www.kaggle.com/c/forest-cover-type-prediction.
In summary, a CSV file contains numerical data about patches of forest land, and we will build a model that esimate the cover type of the patch, from 7 categories (e.g. spruce/fir, aspen, …). See https://www.kaggle.com/c/forest-cover-type-prediction/data for an explanation of the data themselves.
Getting the dataset
Let us create a dedicated repository
mkdir models
mkdir models/covert
The data can be obtained either from Kaggle or from http://juban.free.fr/dd/examples/all/forest_type/train.csv.tar.bz2
cd models/covert
wget http://juban.free.fr/dd/examples/all/forest_type/train.csv.tar.bz2
tar xvjf train.csv.tar.bz2
You can take a look at the raw data:
head -n 5 train.csv
The field ‘Id’ contains every training example id, and ‘Cover_Type’ holds the reference label, i.e. the forest cover type between 1 and 7. There are a total of 15120 training examples.
Creating the machine learning service
The first step with DeepDetect is to start the server:
./dede
and create a machine learning service that uses a multilayered perceptron with 150 hidden neurons in 3 layers, and using prelu activations:
curl -X PUT "http://localhost:8080/services/covert" -d "{\"mllib\":\"caffe\",\"description\":\"forest classification service\",\"type\":\"supervised\",\"parameters\":{\"input\":{\"connector\":\"csv\"},\"mllib\":{\"template\":\"mlp\",\"nclasses\":7,\"layers\":[150,150,150],\"activation\":\"prelu\"}},\"model\":{\"templates\":\"../templates/caffe/\",\"repository\":\"models/covert\"}}"
yields:
{"status":{"code":201,"msg":"Created"}}
Training and testing the service
Let us now train a statistical model in the form of the neural network defined above. Below is a full API call for launching an asynchronous training call on the GPU. Take a look at it, and before proceeding with the call, let us review the call in details below.
curl -X POST "http://localhost:8080/train" -d "{\"service\":\"covert\",\"async\":true,\"parameters\":{\"mllib\":{\"gpu\":true,\"solver\":{\"iterations\":1000,\"test_interval\":100},\"net\":{\"batch_size\":512}},\"input\":{\"label_offset\":-1,\"label\":\"Cover_Type\",\"id\":\"Id\",\"separator\":\",\",\"shuffle\":true,\"test_split\":0.1,\"scale\":true},\"output\":{\"measure\":[\"acc\",\"mcll\",\"f1\"]}},\"data\":[\"models/covert/train.csv\"]}"
First and foremost, we are using our newly created service to train a model. This means that our service will be busy for some time, and we cannot use it for anything else but reviewing the training call status and progress. Other services, if any, would remain available of course. In more details here:
async
allows to start a non-blocking (i.e. asynchronous call)gpu
allows to tell the server we would like to use the GPU. Importantly note that in the absence of GPU, the server will automatically fall back on the CPU, without warningiterations
is the number of training iterations after which the training will terminate automatically. Until termination it is possible to get the status and progress of the call, as we will demonstrate belowlabel_offset
tells the CSV input connectors that the label identifiers run from 1 to 7 instead of 0 to 6. This is required here in order to not miss a classlabel
identifies the reference label column from the CSV dataset- ‘id’ is the column identifier of the samples
test_split
tells the input connector to keep 90% of the training set of training and 10% for assessing the quality of the model being builtshuffle
tells the input connector to shuffle both the training and testing sets, this is especially useful for cross validationscale
tells the input connector to scale all data within [0,1] in order to get similar sensitivity across all dimensions. This usually helps the optimization procedure that underlies learning a neural net.measures
lists the assessment metrics of the model being built,acc
is for accuracy,mcll
for multi-class log loss andf1
for F1-scoredata
holds the dataset file
For more details on the training phase options and parameters, see the API.
Let us now run the call above, the immediate anser is:
{"status":{"code":201,"msg":"Created"},"head":{"method":"/train","job":1,"status":"running"}}
indicating that the call was successful and the training is now running.
You can get the status of the call anytime with another call:
curl -X GET "http://localhost:8080/train?service=covert&job=1"
{"status":{"code":200,"msg":"OK"},"head":{"method":"/train","job":1,"status":"finished","time":61.0},"body":{"measure":{"train_loss":0.6788941025733948,"mcll":0.6393973624892094,"recall":0.7269925057270527,"iteration":999.0,"precision":0.7266408723876882,"f1":0.7268166465273875,"accp":0.7275132275132276,"acc":0.7275132275132276}}
TODO: scaling parameters
Here the final quality of the model can be read as 72.75% accuracy on the testing portion of the dataset. In order to train a much better model, you can increase the number of iterations and the batch size, as well as play with the number of layers and their size. Typically, training the perceptron above for 10000
iterations with batch_size
5210 would yield an accuracy between 82% and 84%.
The status call can be repeated as needed until the status
indicates that the training is finished, after which the job is deleted.
The trained model is now available on disk in the models/covert
repository. If you turn the server off or delete the service without wiping out the files, you will be able to use the trained model from another service.
However for now we show below how to use the current service and model to make prediction from new data.
Prediction for new data
The service is ready for the predict
resource of the API to be used.
Prediction from file
The test data file can be obtained either from Kaggle or from http://juban.free.fr/dd/examples/all/forest_type/test.csv.tar.bz2
cd models/covert
wget http://juban.free.fr/dd/examples/all/forest_type/test.csv.tar.bz2
tar xvjf test.csv.tar.bz2
The full test set has 565892 samples, so let us lower this to a 10 samples (plus the header line) so we can inspect the results more easily:
head -n 11 test.csv > test10.csv
TODO: scaling parameters
and make a predict call:
curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"covert\",\"parameters\":{\"input\":{\"id\":\"Id\",\"separator\":\",\",\"scale\":true}},\"data\":[\"models/covert/test10.csv\"]}"
{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":16.0,"service":"covert"},"body":{"predictions":[{"uri":"15121","loss":0.0,"classes":{"prob":0.9999997615814209,"cat":"6"}},{"uri":"15122","loss":0.0,"classes":{"prob":0.9962882995605469,"cat":"5"}},{"uri":"15130","loss":0.0,"classes":{"prob":0.9999340772628784,"cat":"1"}},{"uri":"15123","loss":0.0,"classes":{"prob":1.0,"cat":"3"}},{"uri":"15124","loss":0.0,"classes":{"prob":1.0,"cat":"3"}},{"uri":"15128","loss":0.0,"classes":{"prob":1.0,"cat":"1"}},{"uri":"15125","loss":0.0,"classes":{"prob":0.9999998807907105,"cat":"3"}},{"uri":"15126","loss":0.0,"classes":{"prob":0.7535045146942139,"cat":"3"}},{"uri":"15129","loss":0.0,"classes":{"prob":0.9999986886978149,"cat":"1"}},{"uri":"15127","loss":0.0,"classes":{"prob":1.0,"cat":"1"}}]}}
In the results above:
* uri
is the Id
of the sample in the test set
* prob
is the probability associated to the predicted class with highest probability
So for instance, sample 15121
was predicted as being of forest cover type 6
with probability 0.99. Do not forget that we did use a label_offset
when training the service. So 6 here corresponds to class 7
on page https://www.kaggle.com/c/forest-cover-type-prediction/data, which is a Krummholz cover type.
Prediction from in-memory data
curl -X POST "http://localhost:8080/predict" -d "{\"service\":\"covert\",\"parameters\":{\"input\":{\"connector\":\"csv\",\"scale\":true,\"min_vals\":[1863,0,0,0,-146,0,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],\"max_vals\":[3849,360,52,1343,554,6890,254,254,248,6993,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}},\"data\":[\"2499,0,9,150,55,1206,207,223,154,859,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\"]}"
TODO: scaling parameters + no header + no id