Training a model from text

This tutorial walks you through the training and using of a machine learning neural network model to classify newsgroup posts into twenty different categories. This makes use of a classical dataset in machine learning, often used for educational purposes.

In summary, a repository contains 20 repositories of text files, each being a newsgroup post.

Getting the dataset

Let us create a dedicated repository


mkdir models
mkdir models/n20

The data can be obtained either from http://www.deepdetect.com/dd/examples/all/n20/news20.tar.bz2


cd models/n20
wget http://www.deepdetect.com/dd/examples/all/n20/news20.tar.bz2
tar xvjf news20.tar.bz2

You can take a look at the raw data:


less news20/sci_crypt/000000616.eml

There are around 20000 files in the dataset.

Creating the machine learning service

The first step with DeepDetect is to start the server, assuming a docker container:


docker run -d -p 8080:8080 -v /path/to/models:/opt/models/ jolibrain/deepdetect_cpu

and create a machine learning service that uses a multi-layered perceptron with 200 hidden neurons in 2 layers, and using relu activations:


curl -X PUT "http://localhost:8080/services/n20" -d '{
       "mllib":"caffe",
       "description":"newsgroup classification service",
       "type":"supervised",
       "parameters":{
         "input":{
           "connector":"txt"
         },
         "mllib":{
           "template":"mlp",
           "nclasses":20,
           "layers":[200,200],
           "activation":"relu"
         }
       },
       "model":{
         "templates":"../templates/caffe/",
         "repository":"/opt/models/n20"
       }
     }'

yields:


{
  "status":{
    "code":201,
    "msg":"Created"
  }
}

Training and testing the service

Let us now train a statistical model in the form of the neural network defined above. Below is a full API call for launching an asynchronous training call on the GPU (with automatic fallback on the CPU if no GPU present). Take a look at it, and before proceeding with the call, let us review the call in details below. We train on 80% of the dataset, and test on the remaining 20%.


curl -X POST "http://localhost:8080/train" -d '{
       "service":"n20",
       "async":true,
       "parameters":{
         "mllib":{
           "gpu":true,
           "solver":{
             "iterations":2000,
             "test_interval":200,
             "base_lr":0.05
           },
           "net":{
             "batch_size":300
           }
         },
         "input":{
           "shuffle":true,
           "test_split":0.2,
           "min_count":10,
           "min_word_length":5,
           "count":false
         },
         "output":{
           "measure":["mcll","f1"]
         }
       },
       "data":["models/n20/news20"]
     }'

First and foremost, we are using our newly created service to train a model. This means that our service will be busy for some time, and we cannot use it for anything else but reviewing the training call status and progress. Other services, if any, would remain available of course. In more details here:

async allows to start a non-blocking (i.e. asynchronous call)
gpu allows to tell the server we would like to use the GPU. Importantly note that in the absence of GPU, the server will automatically fall back on the CPU, without warning
iterations is the number of training iterations after which the training will terminate automatically. Until termination it is possible to get the status and progress of the call, as we will demonstrate below
min_count rejects the words that do not appear often enough
min_word_length rejects the words with length below the specified limit
count determines whether to build a counter for each word or use 0 and 1 only
measures lists the assessment metrics of the model being built, mcll for multi-class log loss and f1 for F1-score
data holds the dataset repository

For more details on the training phase options and parameters, see the API.

Let us now run the call above, the immediate answer is:


{
  “status”:{
    “code”:201,
    “msg”:“Created”
  },
  “head”:{
    “method”:“/train”,
    “job”:1,
    “status”:“running”
  }
}

indicating that the call was successful and the training is now running.

You can get the status of the call anytime with another call:


curl -X GET "http://localhost:8080/train?service=n20&job=1"

yields:


{
  "status":{
    "msg": "OK",
    "code": 200
  },
  "body":{
    "parameters":{
      "mllib":{
        "batch_size": 359
      }
    },
    "measure":{
      "f1": 0.8919178423728972,
      "train_loss": 0.0016851313412189484,
      "mcll": 0.5737156999301365,
      "recall": 0.8926410552973584,
      "iteration": 1999.0,
      "precision": 0.8911958003860988,
      "accp": 0.8936339522546419
    }
  },
  "head":{
    "status": "finished",
    "job": 1,
    "method": "/train",
    "time": 541.0
  }
}

Using the service

You can get predictions on text files and raw text very easily:


curl -X POST 'http://localhost:8080/predict' -d '{
     "service":"n20",
     "parameters":{
       "mllib":{
         "gpu":true
       }
     },
     "data":["my computer runs linux"]
     }'

yield response


{
  "status":{
    "code":200,
    "msg":"OK"
  },
  "head":{
    "method":"/predict",
    "time":226.0,
    "service":"n20"
  },
  "body":{
    "predictions":{
      "uri":"0",
      "classes":{
        "last":true,
        "prob":0.3948741555213928,
        "cat":"comp_graphics"
      }
    }
  }
}

Restarting and using the service


curl -X PUT "http://localhost:8080/services/n20" -d '{
       "mllib":"caffe",
       "description":"newsgroup classification service",
       "type":"supervised",
       "parameters":{
         "input":{
           "connector":"txt"
         },
         "mllib":{
           "nclasses":20
         }
       },
       "model":{
         "repository":"/opt/models/n20"
       }
     }'

The call above does not specify the template parameter anymore since your model has already been specified and trained.