A Set of Character-Based Models for Sentiment Analysis, Ad Blocking and other tasks

Below are a range of character-based deep convolutional neural networks that are free, even for commercial use in your applications. These models have been trained over various corpuses, from sentiment analysis in many languages to advertizing link classification from just reading a URL. They should accomodate a range of applications. Training your own models is made easy too and can lead to even more avenues. Tips and tricks for training are included at the bottom of this page.

These new models read text at character-level, and have very nice properties:

  • They don’t break textual content into words, and thus do not require any specific parsing
  • Character-based models find patterns within the character streams and thus do not suffer from having a limited vocabulary: they are especially attractive for user-generated content with typos and new vocabulary, as well as for a set of novel tasks such as replacing regex rules for ad blocking purposes
  • Models are very compact, i.e. 9.1MB for a typical sentiment or URL classification task trained over several millions of samples
  • Models yield excellent results, beating bag-of-words (BOW) models on large datasets
  • Models can be finetuned from a task A with large corpus to a more targeted task B with smaller corpus

On the other side, these convolutional models:

  • Take often longer to train than BOW models
  • Require large datasets, from several thousands, up to millions of samples

In addition to the directly usable models, this page contains information and results from experiments on character-based models.

Character-based Models

Character-based models supported by DeepDetect are of the kind recently introduced by the Character-level Convolutional Networks for Text Classification and Text Understanding from Scratch papers. These are great papers by Zhang, Zhao and LeCun, sharing all the practical details required to reproduce and benefit from research work, this needs to be emphasized!

These new models are a paradigm change for a series of tasks. For this reason DeepDetect now supports training and using such character-based models directly from the API and in a fully Open Source manner. All instructions on how to train and use the models are below.

Parameters

There are basically three important parameters to the character-based nets:

  • sequence length: this is the fixed size of the character stream that represents every training (and testing/prediction) sample. Below we vary it from 50 to above 1000, depending on the data. The length of the sequence directly affects the required memory and computations required to train a model. Once the length has been specified, the model reads characters backwards from the end of the submitted text piece, and any character beyond the length is thrown away.
  • alphabet: this is the alphabet outside of which characters are considered as whitepace (and vectors of zeros). The size of the alphabet has no great incidence on the memory and computation requirements.
  • convolutions: the shape and structure of the network can vary. Deepdetect supports convolutional network templates, and here the network shape of reference is that of the original papers, "layers":["1CR256", "1CR256", "4CR256", "1024", "1024"], which means two 256 features convolutions, relu and pooling, followed by four convolutions then relu and pooling, and finalized by two 1024 fully connected layers. See the neural network templates API for more details.

Experiments

Literature

Two of the examples reported in the research papers above are replicated with DeepDetect, using the Agnews and DBPedia datasets. The datasets have been arranged in repositories and available for download below so that others can replicate and learn from them:

  • AGNews

    • number of classes: 4
    • alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?:’\“/\|_@#$%^&*~`+-=<>()[]{}
    • dataset size: 128K
    • accuracy: 85%
    • dataset download: agnews_data.tar.bz2
  • DBPedia

    • number of classes: 14
    • alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?:’\“/\|_@#$%^&*~`+-=<>()[]{}
    • dataset size: 630K
    • accuracy: 97%
    • dataset download: dbpedia_data.tar.bz2

Results above are from random splits about the size of the test data reported in the paper. For the record, on the aclimdb dataset, a slightly simpler network reports around 86% accuracy.

Customers

On two customer datasets subjected to user typos and free form writing, we obtain similar or better performances than with BOW or adaboost + random forests, with the gain in the ability to query any word.

No more unknown words

Here is an example testing good and gooooooood with the sent_en_char English sentiment model available below in the page:

curl -X POST 'http://localhost:8080/predict' -d '{"service":"sent_en","parameters":{"mllib":{"gpu":true}},"data":["that s good"]}'
{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":105.0,"service":"sent_en"},"body":{"predictions":{"uri":"0","classes":{"prob":0.8447127342224121,"last":true,"cat":"positive"}}}}

Sentence that s good is positive, gooood.

curl -X POST 'http://localhost:8080/predict' -d '{"service":"sent_en","parameters":{"mllib":{"gpu":true}},"data":["that s gooooooood"]}'
{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":103.0,"service":"sent_en"},"body":{"predictions":{"uri":"0","classes":{"prob":0.8044580221176148,"last":true,"cat":"positive"}}}}

Sentence that s gooooooood is also positive, that’s better!

Novel task

Using the url_ads_mini model given below, URL can be tested for ad blocking purpose:

Testing www.deepdetect.com

curl -X POST 'http://localhost:8080/predict' -d '{"service":"urlads","parameters":{"mllib":{"gpu":true}},"data":["www.deepdetect.com"]}'
{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":143.0,"service":"urlads"},"body":{"predictions":{"uri":"0","classes":{"prob":0.9565762281417847,"last":true,"cat":"other"}}}}

Predicted category is other, we’re safe.

 curl -X POST 'http://localhost:8080/predict' -d '{"service":"urlads","parameters":{"mllib":{"gpu":true}},"data":["ads2.opensubtitles.org/1/www/delivery/ck.php?oaparams=2__bannerid=1__zoneid=3__cb=df11239894__oadest=http%3A%2F%2Fwww.opensubtitles.org%2Faddons%2Fa.php%3Fweblang%3Den%26file%3DWeeds%20Bash"]}'
 {"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":105.0,"service":"urlads"},"body":{"predictions":{"uri":"0","classes":{"prob":0.5866389870643616,"last":true,"cat":"ads"}}}}

OK, predicted category is ads (here in fact the model only used the 150 last characters to make its decision)

Notes on the provided models

  • The models are very good for building and testing an application pipeline that includes one or more deep neural networks. However, the models should not be considered suited for many high accuracy production tasks. This is because most models are rough in the sense that the average accuracy can be low for some sentiment tasks typically.

  • The models are free, even for commercial use. The training sets can unfortunately not be shared publicly. If you need datasets or help with the building on your own datasets, contact us.

  • These models are intended to primarily be used with DeepDetect which relies on Caffe but can be converted. They can be used with Caffe alone, but you will need to build your own feeding and text quantization pipeline. If you’re using TensorFlow, see how to convert Caffe models to Tensorflow, and similar conversion for Torch.

  • Not finding what you need or assistance needed ? Let us know or report difficulties, our pipeline is automated, and some models can be easily built.

What applications are these models good for ?

These models are good for text classification and URL qualification for instance. They are especially useful for building and testing an application pipeline. Typically:

  1. Build up an application that uses one or more deep models
  2. Test the application on your production data
  3. The application can then be made more accurate by either finetuning the deep model or building a new more accurate one. You can do that or ask us for assistance, as needed.

As an example of applications, see how easy it is to build an image search engine with ElasticSearch, the same can be done with these text classifiers.

Requirements

Works best on GPU but fine on multi-core CPU as well. DeepDetect is supported on Ubuntu 14.04 LTS but builds on other Linux flavors.

Model Usage

Below are instructions for setting up a classification service for a given model from command line and from python client. Importantly:

  • The number of classes nclasses needs to be specified at service creation. This is model-dependent, and the number of classes can be obtained from the list below or the model.json file included in the model tarball

  • The service name is for you to set, below examples use the sent_en_char model and the service is named sent_en

  • A batch of multiple text entries can be passed over at once to the server for classification.

Steps for setting up a model service:

  1. Select & download a model tarball
  2. Uncompress in the repository of your choice, e.g. /home/me/models/sent_en_char
  3. Build and run dede, should take 5 minutes on Ubuntu 14.04 LTS
  4. Use the code samples below to build your classification pipeline
  5. See the API for more details on the various parameters and options

Shell

Service creation:

curl -X PUT 'http://localhost:8080/services/sent_en' -d '{"mllib":"caffe","description":"English sentiment classification","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"alphabet":"abcdefghijklmnopqrstuvwxyz0123456789,;.!?'\''","sequence":140},"mllib":{"nclasses":2}},"model":{"repository":"/home/me/models/sent_en_char"}}'

Classification of a piece of text (a file can also be provided):

curl -X POST 'http://localhost:8080/predict' -d '{"service":"sent_en","parameters":{"mllib":{"gpu":true}},"data":["Chilling in the West Indies"]}'

Python

Service creation:

from dd_client import DD

model_repo = '/home/me/models/sent_en_char'
nclasses = 2

# setting up DD client
host = '127.0.0.1'
sname = 'sent_en'
description = 'English sentiment classification'
mllib = 'caffe'
dd = DD(host)
dd.set_return_format(dd.RETURN_PYTHON)

# creating ML service
model = {'repository':model_repo}
parameters_input = {'connector':'txt','characters':True,'sequence':140,'alphabet':"abcdefghijklmnopqrstuvwxyz0123456789,;.!?'"}
parameters_mllib = {'nclasses':nclasses}
parameters_output = {}
dd.put_service(sname,model,description,mllib,
               parameters_input,parameters_mllib,parameters_output)

Classifying a single piece of text:

parameters_input = {}
parameters_mllib = {}
parameters_output = {}
data = ['Chilling in the West Indies']
classif = dd.post_predict(sname,data,parameters_input,parameters_mllib,parameters_output)
print classif

Results

The output naturally comes in JSON form:

{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","time":104.0,"service":"sent_en"},"body":{"predictions":{"uri":"0","classes":{"prob":0.702814519405365,"last":true,"cat":"positive"}}}}

DeepDetect supports turning the JSON output into any custom format through output templates. See the example on how to push results into ElasticSearch without glue code.

Character-Based Text Classification Models

Below are ready-to-use models for a variety of tasks, including URL classification, which is a novel task we introduce here.

List of Character-Based Text Classification Models

  • English Sentiment

    • number of classes: 2
    • alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?’
    • sequence: 140
    • dataset size: 5.5M
    • download: sent_en_char.tar.bz2
  • French Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz|~«°´»¿àáâãçèéêëíîïñóôöùúûüğا—‘’“”•…€
    • sequence: 140
    • dataset size: 630K
    • download: sent_fr_char.tar.bz2
  • Arabic Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`»ו،؛؟ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيًٌٍَُِّْٓ٠١٢٣٤٥٦٧٨٩٪ٰٱٺپچڕڤکڪگڳںھۅۆۈۉۊیﭑﭠﮪﮬﯙﯾﷲﺀﺁﺂﺃﺄﺇﺈﺍﺎﺏﺐﺑﺒﺓﺔﺕﺖﺗﺘﺟﺠﺣﺤﺧﺨﺩﺪﺫﺬﺭﺮﺰﺳﺴﺷﺸﺻﺼﺿﻃﻄﻊﻋﻌﻏﻐﻓﻔﻗﻘﻚﻛﻜﻝﻞﻟﻠﻡﻢﻣﻤﻥﻦﻧﻨﻩﻪﻫﻬﻭﻮﻰﻱﻲﻳﻴﻵﻷﻻﻼ
    • sequence: 70
    • dataset size: 425K
    • download: sent_ar_char.tar.bz2
  • Japanese Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`{|}〜ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわをん゛゜ゝァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセソタダチッツテデトドナニネノハバパヒビピフブプヘベペホボポマミムメモャヤュョラリルレロワン・ーヽヾㄟ一丈三上下不世中久乗予事二京人今仕付代以仲休会位体何作使保信俺像僕優元兄先入全公内写出分切初制前力加勉動化半卒原参友取受口可合同名君告味呼四回国土地垢報場声売変外多夜夢大天太夫女好始嫌嬉子字学安定実家寒寝対小少局屋山川帰年幸度式引弱張強当彡待後心必忘応怖思急性恐悪情想意愛感態我戦所手抜持描放教数敵文料新方日早明春昨昼時普曜書最月有服朝期本来東校格業楽様機次欲歌止死残毎気水泣活流消無然爆物現理生用田申男画界番疲痛発白的目直相真眠着知確神私空立笑粉素終結絡絵絶緒美考者聞腹自良色花苦茶萌落葉血行表要見覚解言訳話誕語読誰調買赤起越足身車輩辛込近返送通連週遅遊過道達違遠部配野金長開間関限難雪電震青面音頑頭題顔願風飛食飯飲験高髪魔黄黒!、・ァィゥェォャュョッーアイウエオカキクケコサシスタチテトナニノハヒフヘホマミムヤラリルロワン゙゚
    • sequence: 50
    • dataset size: 140K
    • download: sent_ja_char.tar.bz2
  • German Sentiment

    • number of classes: 2
    • alphabet: !\“%&’*+,-./0123456789:;<=>?@^_abcdefghijklmnopqrstuvwxyz~´ßáãäçéöüğışนอา—“”…€
    • dataset size: 80K
    • sequence: 50
    • download: sent_de_char.tar.bz2
  • Spanish Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¦¨ª«¬®°´·º»¿×àáâãäçèéêìíðñòóôõöùúüğıńňş–—―‘’“”•…€
    • sequence: 70
    • dataset size: 3M
    • download: sent_es_char.tar.bz2
  • Russian Sentiment

    • number of classes: 2
    • alphabet: !\“#%&’*+,-./0123456789:;<=>?@^_abcdefghijklmnopqrstuvwxyz«»абвгдежзийклмнопрстуфхцчшщъыьэюяёєі“”
    • sequence: 50
    • dataset size: 95K
    • download: sent_ru_char.tar.bz2
  • Thai Sentiment

    • number of classes: 2
    • alphabet: !\“#&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz|~´ωกขคฆงจฉชซญฐณดตถทธนบปผฝพฟภมยรฤลวศษสหฬอฮะัาำิีึืุูเแโใไๆ็่้๊๋์ํ๑“”•
    • sequence: 50
    • dataset size: 84K
    • download: sent_th_char.tar.bz2
  • Italian Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz|~¡°´¿àáãçèéêìíñòóôöùúüğış—“”
    • sequence: 90
    • dataset size: 185K
    • download: sent_it_char.tar.bz2
  • Turkish Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz{|}~ª´ßáâãçéíîñóöûüğışəɨα‎—‘’“”
    • sequence: 50
    • dataset size: 470K
    • download: sent_tk_char.tar.bz2
  • Portuguese Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz|~¡ª°´º¿àáâãçèéêíñòóôõöúüğışا—‘’“”•…€
    • sequence: 110
    • dataset size: 190K
    • download: sent_pt_char.tar.bz2
  • Czech Sentiment

    • number of classes: 2
    • alphabet: !\“’*+,-./0123456789:;=?@_`abcdefghijklmnopqrstuvwxyz´áãçéêíóôúüýčďěıňřşšůž
    • sequence: 50
    • dataset size: 30K
    • download: sent_cs_char.tar.bz2
  • Finish Sentiment

    • number of classes: 2
    • alphabet: !\“$&’*+,-./0123456789:;<=>?@^_`abcdefghijklmnopqrstuvwxyz|~ª´áãäçéêíñóöüğışɨ—“”
    • sequence: 50
    • dataset size: 65K
    • download: sent_fi_char.tar.bz2
  • Indonesian Sentiment

    • number of classes: 2
    • alphabet: !\“#$%&’*+,-./0123456789:;<=>?@[]^_`abcdefghijklmnopqrstuvwxyz{|}~¡£¤¥§¨©ª«¬®¯°±²³´µ·¸º»½¿ßàáâãäåçèéêëìíîïðñóôõöùúûüýþÿāăąđēĕęĝğġģĥħĩīĭıĵķĸľňʼnŋōőśŝşšŧũūŭůųŵŷƌƍƙƚƞơƥƨƪƭƴƿǎǐǘǚǝǟǥǧǩǰǻȃȋɐɑɓɔə
    • sequence: 140
    • dataset size: 4.2M
    • download: sent_id_char.tar.bz2
  • Korean Sentiment

    • number of classes: 2
    • alphabet: “!\“#‘()*+,-./0123456789:;<=>?@[]^_abcdefghijklmnopqrstuvwxyz}~·“”가각간갈감갑강같개거걱건걸것게겠겨격결경계고곤공과관괜교구국군굿궁귀그근글금 급기긴길김까꺼께꼭꾸꿈끝나난날남내냐냥너널넘네녀녁년념녕노놀놓누눈느는늘능늦니닌님다단달담답당대더덕던데도독돌동돼되된될됩두드든들듯등디따때떻또똑뜻라락란람랑래랜러런럼렇레려력렸로록론료루르른를름리린릴림립마막만많말맙맛망맞매머먹멋메멘며면명모목몬몰몸못무문물뭐미민바박반받발밤밥방배백버번벌범베벤변별보복본볼봄봇봐봤부분불브블비빠빨쁘쁜사산살상새색생서선설성세셔션셨소속손송수순쉬슈스슨슬습승시식신실싫심십싶싸써쓰씨아안않알았앞애야약양어언얼엄업없었에엔여역연열였영예오온올와완왔왜외요용우운울웃워원월위윗유으은을음응의이인일임입잇있자작잖잘잠장재쟁저적전절점정제져졌조존좀종좋죄죠주준줄중줘즈즐즘증지직진질집짜짱째쪽찌찍차착찬찮참창찾책처천첨청체쳐초최추축출취치친침카커케코콘퀴크키타탁태터테텐토통투트틀티팅파판팔팬퍼페편포표풀품프픈플피필하학한할함합항해했행헤현형호혼홍화확환활회후휴흐희히힘”
    • sequence: 50
    • dataset size: 20K
    • download: sent_ko_char.tar.bz2
  • English Movie Reviews

    • number of classes: 2
    • alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?:’\“/\|_@#$%^&*~`+-=<>()[]{}
    • sequence: 1014
    • dataset size: 43K
    • download: movr.tar.bz2
  • URL ads

    • number of classes: 2
    • alphabet: abcdefghijklmnopqrstuvwxyz0123456789,;.!?’
    • sequence: 150
    • dataset size: 4M
    • download: url_ads_mini.tar.bz2
    • tips: avoid or encode the http://, https:// and other tcp:// in front of URLs passed to dede: the server would try to resolve the URL instead of passing it to the model.

Character-Based Model Training

Below is a sample script used for training the models above, along with the full API, this allows to train a wide range of models with little effort.

The textual input dataset can come in two forms:

  • a repository with a sub-repository per class, in which each training sample comes as a file;
  • a repository with a sub-repository per class, in which one or more files contain one sample per line. In this particular case, the parameter sentences:true needs to be passed to the input connector parameters at training time.
# -*- coding: utf-8 -*-

import sys, os, time, argparse
from dd_client import DD

parser = argparse.ArgumentParser(description='Text model training tool')
parser.add_argument('--model-repo',help='location of the model')
parser.add_argument('--training-repo',help='location of the training files')
parser.add_argument('--sname',help='service name')
parser.add_argument('--tsplit',help='training split between 0 and < 1',type=float,default=0.01)
parser.add_argument('--base-lr',help='initial learning rate',default=0.01,type=float)
parser.add_argument('--sequence',help='sequence length for character level models',default=140,type=int)
parser.add_argument('--iterations',help='number of iterations',default=50000,type=int)
parser.add_argument('--test-interval',help='test interval',default=1000,type=int)
parser.add_argument('--destroy',help='whether to destroy model',action='store_true')
parser.add_argument('--resume',help='whether to resume training',action='store_true')
parser.add_argument('--nclasses',help='number of classes',type=int,default=2)
args = parser.parse_args()

nclasses = args.nclasses
stepsize = 15000
host = 'localhost'
port = 8080
description = 'character-based classifier'
mllib = 'caffe'
dd = DD(host,port)
dd.set_return_format(dd.RETURN_PYTHON)

# creating ML service
model_repo = args.model_repo
sequence = args.sequence
template = 'convnet'
layers = ['1CR256','1CR256','4CR256','1024','1024']
model = {'templates':'../templates/caffe/','repository':model_repo}
parameters_input = {'connector':'txt','sentences':True,'characters':True,'sequence':sequence}

## use the line below to specify the alphabet
#parameters_input['alphabet'] = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?'"#\"/\\|_@#$%^&*~`+-=<>"

parameters_mllib = {'template':template,'nclasses':nclasses,'layers':layers,'db':True,'dropout':0.5}
parameters_output = {}
dd.put_service(args.sname,model,description,mllib,
               parameters_input,parameters_mllib,parameters_output)

# training
train_data = [args.training_repo]
parameters_input = {'test_split':args.tsplit,'shuffle':True,'db':True,'characters':True,'sequence':sequence}

## comment out the line below if training from one document per file (as opposed to one or more files with one sample per line)
parameters_input['sentences'] = True
parameters_mllib = {'gpu':True,'resume':args.resume,'net':{'batch_size':128,'test_batch_size':128},'solver':{'test_interval':args.test_interval,'test_initialization':True,'base_lr':args.base_lr,'solver_type':'SGD','iterations':args.iterations,'iter_size':1,'lr_policy':'step','stepsize':stepsize,'gamma':0.5,'snapshot':args.iterations,'weight_decay':0.00001}}
parameters_output = {'measure':['mcll','f1']}
if nclasses == 2:
    parameters_output['measure'].append('auc')
dd.post_train(args.sname,train_data,parameters_input,parameters_mllib,parameters_output,async=True)
time.sleep(1)
train_status = ''
while True:
    train_status = dd.get_train(args.sname,job=1,timeout=10)
    if train_status['head']['status'] == 'running':
        print train_status['body']['measure']
    else:
        print train_status
        break

# deleting the model, comment it out or remove the clear= parameter to keep the model.
clear = ''
if args.destroy:
    clear = 'lib'
dd.delete_service(args.sname,clear=clear)

Example

To reproduce the original paper’s results on the AGNews corpus:

python train_simple.py --model-repo /path/to/models/agnews/ --training-repo /path/to/agnews_data/ --sname agnews --tsplit 0.1 --base-lr 0.01 --sequence 1014 --iterations 150000 --test-interval 1000 --nclasses 4

This should take a few hours, the script reports training loss and various accuracy measures every 10 seconds.

Tips & Tricks

  • ADAM doesn’t work as well as SGD with stepsize reduction of the learning rate
  • Using too long sequences wrt the mean of the text or sentences mean size hampers the learning, due to the accumulation of too many padding zeros in character quantization
  • Small sequences (e.g. ~50) need to be accomodated with a lower number of embedded convolutions, otherwise the output is too narrow for the fully connected layers to capture relevant information
  • Larger amount of data is needed than with BOW. Typically, character-based CNN do overfit very quickly on the classic 20 newsgroup benchmark dataset
  • weight_decay plays a role, if set too high, it’s been observed that it hampers the convergence

DeepDetect documentation