This blog post shows the state-of-the-art results in Intent Classification obtained on the three corpora:
The notebook with the code and results is available here:
If you already have a basic understanding of the Intent classification for text, check out the original paper:Subword Semantic Hashing for Intent Classification on Small Datasets
This post is divided into two parts:
- We used a count based vectorized hashing technique which is enough to beat the previous state-of-the-art results in Intent Classification Task.
- We will look into the training of hash embeddings based language models to further improve the results.
Let’s start with the Part 1.
Three datasets for Intent classification task. Two of the corpora were extracted from StackExchange and the third one from a Telegram chatbot. The datasets looks like the following:
- AskUbuntu Corpus: 5 Intents, 162 samples
- Web Application Corpus: 8 Intents, 100 samples
- Chatbot Corpus: 2 Intents, 206 samples
More information about the datasets are available here:
We checked the previous benchmarks on these datasets. The initial results were reported in the paper ‘Evaluating Natural Language Understanding Services for Conversational Question Answering Systems’. The results looked pretty good.
Soon after, another benchmark was released by Botfuel, that surpassed the previous results and became the new benchmark results for all three datasets.
In short, the benchmark before our experiments is well plotted in the following graph.
Since no tech description is available for any of these bot platforms, it is good to believe that they use some featurizer and a classifier for classifying the Intents. But nothing can be said for sure what exactly is used by which platform.
For featurizer, Word2vec seems to be a default good choice for embeddings. Even fastText seemed to be a good alternative. Since the dataset is too small, we decided not to use deep learning based embeddings. Also, the data is in form of free text (commonly typed text sentences by people on the internet), which makes a higher chance of spelling mistakes and out of vocab words. Also, we wanted to keep the training and inference time low. So we decided to use a semantic hashing for our case.
If you want to replicate the results as show above, here is the Evaluation Script for the same :https://github.com/sebischair/NLU-Evaluation-Scripts
THE DATASET SPLIT
The first task before evaluation was to split the dataset into train, test and validate. We used KStratified Splits to perform a k-fold cross valdiations on the data splits.
However, in the benchmark provided by Botfuel, they used a different test-train split. So to compare the consistency in the results, we used the same split as used by Botfuel. Our final results and the plots uses the same split as used by Botfuel benchmarks.
We did not do any kind of modification on test datasets and neither did we see the datasets in advance to fit our data more towards the test set.
The dataset used by Botfuel and Recast can be found here which was used for the benchmarking.Botfuel/benchmark-nlp-2018
The dataset used by Botfuel and Recast can be found here which was used for the benchmarking.
THE PRE-PROCESSING STEP
We did use the standard preprocessing pipeline : tokenization, lemmatization and stemming. Although no better results were achieved when using lemmatization, and hence we removed it from the pipeline.
Upon looking into the training data distribution, we found that it was highly imbalanced. Techniques like SMOTE and ADASYN are good for data balancing, but in our case the dataset was only imbalanced for one of the Intents. Undersampling the dataset was not a good idea as the training dataset was very small. Also, oversampling of the dataset did not work as the number of samples for one specific class was very less (only 2 samples). So used some augmentation techniques to increase the data size.
We started with a simple data augmentation technique. We took the samples from that class that had less samples in the training split. On these samples, we performed a dictionary based synonym replacement of nouns and verbs to generate new sentences.
Original sentence : What can I do to improve my running skills?
Augmented data generated :
* What can I do to advance my running skills?
* What can I do to better my running skills?
* What can I do to correct my running know-how?
* What can I do to correct my running proficiency?
It did help in our case. There was a gain in accuracy of 1.4 – 2 percent.
Since this is a chatbot dataset and there are higher chances of making spelling mistakes, we need to take that into account too. One simple way to correct the spelling mistakes is to find the Levensthein distance and map the word to it’s nearest neighbour when a spelling mistake is encountered.
A dictionary with vocabulary specific to the dataset can be used instead of a general English word dictionary for domain specific spelling correction. However, our dataset is very small and there is a high chance of a word which might be out of the dataset specific dictionary and hence we did not create a domain specific dictionary.
However, we tried a different approach. We took a key present in the keyboard and we mapped the nearest keys around it. For example: We took the key ‘s’ and we mapped nearby keys ‘a’, ‘w’, ‘d’, and ‘z’. It is very likely that the typo errors are due to pressing the nearby key in the keyboard to a specific character instead of the real one. We created a list of all the keys and their nearby distances. We also kept the nearby keys fixed to one surrounding key only.
We then set a Mistake Probability P. For any character C, swap it with another character C’ that is within d distance from C and has an error probability greater than P. We predefined d and P.
The above mentioned augmentation technique boosted our accuracy by 1.5–2 percent. (Enough to achieve the state-of-the-art results).
For the featurizer, we decided to use Semantic Hashing. To do this, we came up with two ideas:
1. Using a hash function, make a vocabulary and keep a count associated with it. The advantage is that it is easy to train, takes less time, and is good for smaller datasets. One big disadvantage is that the method fails when there is a very high collison rates.
2. Training a language model using hash vectors in an unsupervised fashion and then using it for our use case.
In this post, we will only talk about the first case and second case will be discussed in details in Part II of this blog.
Given an input text T,
“I have a flying disk”
tokenize it to create a list of tokens t_i. The output of the tokenization should look like:
“I”, “have”, “a”, “flying”, “disk”]
Pass each token into a pre-hashing function H(t_i) to generate sub-tokens t^j_i, where j is the index of the sub-tokens. E.g.:
H(have)=[\#ha, hav, ave, ve\#]
H(t_i) first appends a # at the beginning and at the end of a token, and then extracts trigrams from it. These trigrams are the sub-tokens t^j_i.
H(t_i) can then be applied to the entire corpus to generate sub-tokens. These sub-tokens are then used to create a Vector Space Model (VSM). This VSM should be used to extract features for a given input text. In other words, this acts as a hashing function for an input text sequence T.
Some other work has been done in past to overcome the shortcomings of the embedding based approaches by training hash embedding. The following diagram shows a couple of ways to train hash embedding.
We benchmarked the results across multiple classifiers to evaluate the results of using semantic hashing as featurizer. For that, we used all the classifiers provided by Scikit-learn.
For most classifiers, we used the default parameters but we did make some changes in the parameters of some classifiers like defining a prior value based on the class distribution for Naive Bayes, a grid search for the number of hidden layers in MLP and so on. All the information is available in the Notebook.
We used the same training and test datasets for all the classifiers and plotted the results of the same.
We used a Grid Search on KNN, MLP and SGD Classifiers to find the best hyper-parameter
While training various classifiers, we plotted the accuracies of the classifiers along with their Train-Test time on a single graph to see how various classifiers performs with SemHash. Following are some of the plots achieved:
Results were evaluated based on the Precision, Recall, F1 score and the Accuracy. For benchmarking and comparison, we sticked with the F1 score as it was used in the previous benchmarks.
Classifiers like Random Forest and SGD Classifier were consistent with good results across all the datasets while the F1 score went up and down a bit for the others.
NOTE : One thing to note here is that the results in the graph is taken from the benchmarks of the Recast and the average of Watson over the 3 coprus (0.97+0.92+0.83)/3 = 0.9066 which does not matches with the result of 92 mentioned in average section. However, a weighted average (0.97*100+0.92*53+0.83*30)/183 gives 0.9325 which is again not equal to 92 as mentioned.
So, we are totally uncertain about the averaging method used above. In our case, we simply did an average over the 3 datasets.
THE PRINCESS CAKE (Klassisk Prinsesstårta)
Time to take a bite.
The results were achieved during a HackNight in Lulea Technical University, in Prof. Marcus Liwicki lab along with Fotini, Pedro, Ayushman, Amit, Vinay and Gustav. A big thanks to all of the them for their contributions individually and as a team.