Training XGBoost from CSV

This tutorial shows how to train decision trees over a dataset in CSV format. Within the DeepDetect server, gradient boosted trees, a form of decision trees, are a very powerful and often faster alternative to deep neural networks.

Typically:

  • they are easier to train, and often yield excellent results without much tuning
  • they also capture patterns that can be different from those commonly captured by deep nets.
  • they are in many cases less disturbed by missing and very noisy data

Note: gradient boosted trees via XGBoost train on CPU only (GPU support is experimental), in a very efficient manner, making use of all available cores. There’s support for very large datasets.

Summary

This tutorial walks you through the training and using of an XGBoost model to estimate the tree cover type based on tree data. This makes use of the well-known ‘Cover Type’ dataset, as presented in the Kaggle competition https://www.kaggle.com/c/forest-cover-type-prediction. It is identical to the tutorial on training neural networks on this same dataset but shows how to use boosted trees instead.

In summary, a CSV file contains numerical data about patches of forest land, and we will build a model that esimate the cover type of the patch, from 7 categories (e.g. spruce/fir, aspen, …). See https://www.kaggle.com/c/forest-cover-type-prediction/data for an explanation of the data themselves.

Getting the dataset

Let us create a dedicated repository


mkdir models
mkdir models/covert

The data can be obtained either from Kaggle or from http://www.deepdetect.com/dd/examples/all/forest_type/train.csv.tar.bz2


cd models/covert
wget http://www.deepdetect.com/dd/examples/all/forest_type/train.csv.tar.bz2
tar xvjf train.csv.tar.bz2

You can take a look at the raw data:


head -n 5 train.csv

The field ‘Id’ contains every training example id, and ‘Cover_Type’ holds the reference label, i.e. the forest cover type between 1 and 7. There are a total of 15120 training examples.

Creating the machine learning service

The first step with DeepDetect is to start the server, via Docker:


docker run -d -p 8080:8080 -v /path/to/models/:/opt/models/ jolibrain/deepdetect_cpu
and create a machine learning service that uses boosted trees:


curl -X PUT "http://localhost:8080/services/covert" -d '{
       "mllib":"xgboost",
       "description":"forest classification service",
       "type":"supervised",
       "parameters":{
         "input":{
           "connector":"csv"
         },
         "mllib":{
           "nclasses":7
         }
       },
       "model":{
         "repository":"/opt/models/covert"
       }
     }'
yields:

{
  "status":{
    "code":201,
    "msg":"Created"
  }
}

Training and testing the service

Let us now train a statistical model in the form of gradient boosted trees defined. Below is a full API call for launching an asynchronous training call on the CPU. Take a look at it, and before proceeding with the call, let us review the call in details below.


curl -X POST "http://localhost:8080/train" -d '{
       "service":"covert",
       "async":true,
       "parameters":{
         "mllib":{
           "iterations":100,
           "test_interval":10,
           "objective":"multi:softprob"
         },
         "input":{
           "label_offset":-1,
           "label":"Cover_Type",
           "id":"Id",
           "separator":",",
           "shuffle":true,
           "test_split":0.1
         },
         "output":{
           "measure":["acc","mcll","f1"]
         }
       },
       "data":["/opt/models/covert/train.csv"]
     }'

First and foremost, we are using our newly created service to train a model. This means that our service will be busy for some time, and we cannot use it for anything else but reviewing the training call status and progress. Other services, if any, would remain available of course. In more details here:

  • async allows to start a non-blocking (i.e. asynchronous call)
  • iterations is the number of training iterations after which the training will terminate automatically. Until termination it is possible to get the status and progress of the call, as we will demonstrate below
  • label_offset tells the CSV input connectors that the label identifiers run from 1 to 7 instead of 0 to 6. This is required here in order to not miss a class
  • label identifies the reference label column from the CSV dataset
  • ‘id’ is the column identifier of the samples
  • test_split tells the input connector to keep 90% of the training set of training and 10% for assessing the quality of the model being built
  • shuffle tells the input connector to shuffle both the training and testing sets, this is especially useful for cross validation
  • measures lists the assessment metrics of the model being built, acc is for accuracy, mcll for multi-class log loss and f1 for F1-score
  • data holds the dataset file

For more details on the training phase options and parameters, see the API.

Let us now run the call above, the immediate answer is:


{
  “status”:{
    “code”:201,
    “msg”:“Created”
  },
  “head”:{
    “method”:“/train”,
    “job”:1,
    “status”:“running”
  }
}
indicating that the call was successful and the training is now running.

You can get the status of the call anytime with another call:


curl -X GET “http://localhost:8080/train?service=covert&job=1"

{
  “status”:{
    “code”:200,
    “msg”:“OK”
  },
  “head”:{
    “method”:“/train”,
    “job”:1,
    “status”:“finished”,
    “time”:61.0
  },
  “body”:{
    “measure”:{
      “train_loss”:0.6788941025733948,
      “mcll”:0.6393973624892094,
      “recall”:0.7269925057270527,
      “iteration”:20.0,
      “precision”:0.7266408723876882,
      “f1”:0.7268166465273875,
      “accp”:0.7275132275132276,
      “acc”:0.7275132275132276
    }
  }
}

Below is what the GET /train call after the training has finished should yield:


curl -X GET "http://localhost:8080/train?service=covert&job=1"

{
  "status":{
    "code":201,
    "msg":"Created"
  },
  "body":{
    "model":{
      "repository":"/path/to/projects/deepdetect/models/covert"
    },
    "parameters":{
      "input":{}
    },
    "measure":{
      "mcll":0.37747302382271999,
      "recall":0.848371708766895,
      "iteration":99.0,
      "precision":0.8512590836363044,
      "f1":0.8498129436296431,
      "accp":0.8531746031746031,
      "acc":0.8531746031746031
    }
  },
  "head":{
    "method":"/train",
    "time":5.0
  }
}

The final quality of the model can be read as 84.9% accuracy on the testing portion of the dataset.

The status call can be repeated as needed until the status indicates that the training is finished, after which the job is deleted.

The trained model is now available on disk in the models/covert repository. If you turn the server off or delete the service without wiping out the files, you will be able to use the trained model from another service.

However for now we show below how to use the current service and model to make prediction from new data.

Prediction for new data

The service is ready for the predict resource of the API to be used.

Prediction from file

The test data file can be obtained either from Kaggle or from http://www.deepdetect.com/dd/examples/all/forest_type/test.csv.tar.bz2


cd models/covert
wget http://www.deepdetect.com/dd/examples/all/forest_type/test.csv.tar.bz2
tar xvjf test.csv.tar.bz2

The full test set has 565892 samples, so let us lower this to a 10 samples (plus the header line) so we can inspect the results more easily:


head -n 11 test.csv > test10.csv

and make a predict call, passing the scaling parameters:


curl -X POST “http://localhost:8080/predict" -d ‘{
       “service”:“covert”,
       “parameters”:{
         “input”:{
           “id”:“Id”,
           “separator”:“,”
         }
       },
       “data”:[“/opt/models/covert/test10.csv”]
     }’

{
  “status”:{
    “code”:200,
    “msg”:“OK”
  },
  “head”:{
    “method”:“/predict”,
    “service”:“covert”,
    “time”:6.0
  },
  “body”:{
    “predictions”:[
      {
        “uri”:“15121”,
        “classes”:{
          “last”:true,
          “prob”:0.7691145539283752,
          “cat”:“4”
        }
      },
      {
        “uri”:“15122”,
        “classes”:{
          “last”:true,
          “prob”:0.5773618221282959,
          “cat”:“4”
        }
      },
      {
        “uri”:“15123”,
        “classes”:{
          “last”:true,
          “prob”:0.5504216551780701,
          “cat”:“0”
        }
      },
      {
        “uri”:“15124”,
        “classes”:{
          “last”:true,
          “prob”:0.6419171094894409,
          “cat”:“0”
        }
      },
      {
        “uri”:“15125”,
        “classes”:{
          “last”:true,
          “prob”:0.6680205464363098,
          “cat”:“0”
        }
      },
      {
        “uri”:“15126”,
        “classes”:{
          “last”:true,
          “prob”:0.6034393906593323,
          “cat”:“0”
        }
      },
      {
        “uri”:“15127”,
        “classes”:{
          “last”:true,
          “prob”:0.47125744819641116,
          “cat”:“4”
        }
      },
      {
        “uri”:“15128”,
        “classes”:{
          “last”:true,
          “prob”:0.45334020256996157,
          “cat”:“4”
        }
      },
      {
        “uri”:“15129”,
        “classes”:{
          “last”:true,
          “prob”:0.6540985107421875,
          “cat”:“0”
        }
      },
      {
        “uri”:“15130”,
        “classes”:{
          “last”:true,
          “prob”:0.5556557178497315,
          “cat”:“0”
        }
      }
    ]
  }
}

In the results above: * uri is the Id of the sample in the test set * prob is the probability associated to the predicted class with highest probability

So for instance, sample 15121 was predicted as being of forest cover type 4 with probability 0.769. Do not forget that we did use a label_offset when training the service. So 0 here corresponds to class 1 on page https://www.kaggle.com/c/forest-cover-type-prediction/data, which is a Spruce/Fir cover type.

Prediction from in-memory data


curl -X POST "http://localhost:8080/predict" -d '{
       "service":"covert",
       "parameters":{
         "input":{
           "connector":"csv"
         }
       },
       "data":["2499,0,9,150,55,1206,207,223,154,859,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0"]
     }'

{
  "status":{
    "code":200,
    "msg":"OK"
  },
  "head":{
    "method":"/predict",
    "service":"covert",
    "time":0.0
  },
  "body":{
    "predictions":{
      "uri":"1",
      "classes":{
        "last":true,
        "prob":0.9539504647254944,
        "cat":"2"
      }
    }
  }
}

Importantly, note that in the call above there’s no mention of the Id field, and that the Id value has been stripped from each array of scaling parameters. This is because while in training mode, the datasets often hold an id per training sample, but when predicting, it is less common.

Related