Know your Intent: State of the Art results in Intent Classification for Text

Read on Medium
F1 Score comparison of various platforms with our method on three datasets

This blog post shows the state-of-the-art results in Intent Classification obtained on the three corpora:

  1. Ask Ubuntu Corpus
  2. Web Application Corpus
  3. Chatbot Corpus

The notebook with the code and results is available here:

If you already have a basic understanding of the Intent classification for text, check out the original paper:

Subword Semantic Hashing for Intent Classification on Small Datasets

This post is divided into two parts:

  1. We used a count based vectorized hashing technique which is enough to beat the previous state-of-the-art results in Intent Classification Task.
  2. We will look into the training of hash embeddings based language models to further improve the results.

Let’s start with the Part 1.

THE CHALLENGE

Three datasets for Intent classification task. Two of the corpora were extracted from StackExchange and the third one from a Telegram chatbot. The datasets looks like the following:

  1. AskUbuntu Corpus: 5 Intents, 162 samples
  2. Web Application Corpus: 8 Intents, 100 samples
  3. Chatbot Corpus: 2 Intents, 206 samples

More information about the datasets are available here:

THE BENCHMARKS

We checked the previous benchmarks on these datasets. The initial results were reported in the paper ‘Evaluating Natural Language Understanding Services for Conversational Question Answering Systems’. The results looked pretty good.

Soon after, another benchmark was released by Botfuel, that surpassed the previous results and became the new benchmark results for all three datasets.

Benchmarking results for all three datasets as released by Botfuel
Botfuel/benchmark-nlp-2018

Also, in January 2018, Snips ran an additional series of experiments to evaluate their Natural Language Understanding (NLU) solution against RASA NLU. Below are the results obtained.

source

In short, the benchmark before our experiments is well plotted in the following graph.

source

Since no tech description is available for any of these bot platforms, it is good to believe that they use some featurizer and a classifier for classifying the Intents. But nothing can be said for sure what exactly is used by which platform.

For featurizer, Word2vec seems to be a default good choice for embeddings. Even fastText seemed to be a good alternative. Since the dataset is too small, we decided not to use deep learning based embeddings. Also, the data is in form of free text (commonly typed text sentences by people on the internet), which makes a higher chance of spelling mistakes and out of vocab words. Also, we wanted to keep the training and inference time low. So we decided to use a semantic hashing for our case.

If you want to replicate the results as show above, here is the Evaluation Script for the same :https://github.com/sebischair/NLU-Evaluation-Scripts

THE DATASET SPLIT

The first task before evaluation was to split the dataset into train, test and validate. We used KStratified Splits to perform a k-fold cross valdiations on the data splits.

However, in the benchmark provided by Botfuel, they used a different test-train split. So to compare the consistency in the results, we used the same split as used by Botfuel. Our final results and the plots uses the same split as used by Botfuel benchmarks.

Train Test Split as used in other benchmarks

We did not do any kind of modification on test datasets and neither did we see the datasets in advance to fit our data more towards the test set.

The dataset used by Botfuel and Recast can be found here which was used for the benchmarking.

Botfuel/benchmark-nlp-2018

The dataset used by Botfuel and Recast can be found here which was used for the benchmarking.

THE PRE-PROCESSING STEP

We did use the standard preprocessing pipeline : tokenization, lemmatization and stemming. Although no better results were achieved when using lemmatization, and hence we removed it from the pipeline.

Upon looking into the training data distribution, we found that it was highly imbalanced. Techniques like SMOTE and ADASYN are good for data balancing, but in our case the dataset was only imbalanced for one of the Intents. Undersampling the dataset was not a good idea as the training dataset was very small. Also, oversampling of the dataset did not work as the number of samples for one specific class was very less (only 2 samples). So used some augmentation techniques to increase the data size.

THE AUGMENTATION

We started with a simple data augmentation technique. We took the samples from that class that had less samples in the training split. On these samples, we performed a dictionary based synonym replacement of nouns and verbs to generate new sentences.

Original sentence : What can I do to improve my running skills?
Augmented data generated :
* What can I do to advance my running skills?
* What can I do to better my running skills?
* What can I do to correct my running know-how?
* What can I do to correct my running proficiency?

It did help in our case. There was a gain in accuracy of 1.4 – 2 percent.

Since this is a chatbot dataset and there are higher chances of making spelling mistakes, we need to take that into account too. One simple way to correct the spelling mistakes is to find the Levensthein distance and map the word to it’s nearest neighbour when a spelling mistake is encountered.

A dictionary with vocabulary specific to the dataset can be used instead of a general English word dictionary for domain specific spelling correction. However, our dataset is very small and there is a high chance of a word which might be out of the dataset specific dictionary and hence we did not create a domain specific dictionary.

However, we tried a different approach. We took a key present in the keyboard and we mapped the nearest keys around it. For example: We took the key ‘s’ and we mapped nearby keys ‘a’, ‘w’, ‘d’, and ‘z’. It is very likely that the typo errors are due to pressing the nearby key in the keyboard to a specific character instead of the real one. We created a list of all the keys and their nearby distances. We also kept the nearby keys fixed to one surrounding key only.

We then set a Mistake Probability P. For any character C, swap it with another character C’ that is within d distance from C and has an error probability greater than P. We predefined d and P.

Cartesian points for keys from others

The above mentioned augmentation technique boosted our accuracy by 1.5–2 percent. (Enough to achieve the state-of-the-art results).

THE FEATURIZER

For the featurizer, we decided to use Semantic Hashing. To do this, we came up with two ideas:

1. Using a hash function, make a vocabulary and keep a count associated with it. The advantage is that it is easy to train, takes less time, and is good for smaller datasets. One big disadvantage is that the method fails when there is a very high collison rates.

2. Training a language model using hash vectors in an unsupervised fashion and then using it for our use case.

In this post, we will only talk about the first case and second case will be discussed in details in Part II of this blog.

Original idea of semantic Hashing

Given an input text T,

“I have a flying disk”

tokenize it to create a list of tokens t_i. The output of the tokenization should look like:

“I”, “have”, “a”, “flying”, “disk”]

Pass each token into a pre-hashing function H(t_i) to generate sub-tokens t^j_i, where j is the index of the sub-tokens. E.g.:

H(have)=[\#ha, hav, ave, ve\#]

H(t_i) first appends a # at the beginning and at the end of a token, and then extracts trigrams from it. These trigrams are the sub-tokens t^j_i.

H(t_i) can then be applied to the entire corpus to generate sub-tokens. These sub-tokens are then used to create a Vector Space Model (VSM). This VSM should be used to extract features for a given input text. In other words, this acts as a hashing function for an input text sequence T.

Some other work has been done in past to overcome the shortcomings of the embedding based approaches by training hash embedding. The following diagram shows a couple of ways to train hash embedding.

Source
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

THE CLASSIFIER

We benchmarked the results across multiple classifiers to evaluate the results of using semantic hashing as featurizer. For that, we used all the classifiers provided by Scikit-learn.

For most classifiers, we used the default parameters but we did make some changes in the parameters of some classifiers like defining a prior value based on the class distribution for Naive Bayes, a grid search for the number of hidden layers in MLP and so on. All the information is available in the Notebook.

We used the same training and test datasets for all the classifiers and plotted the results of the same.

We used a Grid Search on KNN, MLP and SGD Classifiers to find the best hyper-parameter

THE PLOT

While training various classifiers, we plotted the accuracies of the classifiers along with their Train-Test time on a single graph to see how various classifiers performs with SemHash. Following are some of the plots achieved:

Accuracy plots on all three datasets
Accuracy and Train-Test time plots for AskUbuntu Corpus
Accuracy and Train-Test time plots for Chatbot Corpus
Accuracy and Train-Test time plots for WebApplication Corpus

THE RESULTS

Results were evaluated based on the Precision, Recall, F1 score and the Accuracy. For benchmarking and comparison, we sticked with the F1 score as it was used in the previous benchmarks.

Classifiers like Random Forest and SGD Classifier were consistent with good results across all the datasets while the F1 score went up and down a bit for the others.

Result Comparison of all platforms

NOTE : One thing to note here is that the results in the graph is taken from the benchmarks of the Recast and the average of Watson over the 3 coprus (0.97+0.92+0.83)/3 = 0.9066 which does not matches with the result of 92 mentioned in average section. However, a weighted average (0.97*100+0.92*53+0.83*30)/183 gives 0.9325 which is again not equal to 92 as mentioned.

So, we are totally uncertain about the averaging method used above. In our case, we simply did an average over the 3 datasets.

F1 Score comparison of different platforms with our method

THE PRINCESS CAKE (Klassisk Prinsesstårta)

Time to take a bite.




The results were achieved during a HackNight in Lulea Technical University, in Prof. Marcus Liwicki lab along with Fotini, Pedro, Ayushman, Amit, Vinay and Gustav. A big thanks to all of the them for their contributions individually and as a team.

The A-Team




Comment on Medium