Benchmarking Named Entity Recognition: StanfordNLP, IBM, spaCy, Dialogflow, and TextSpace

In this post, we compare how TextSpace’s Named Entity Recognition (NER) performs against state-of-the-art solutions provided by StanfordNLP , IBM, spaCy and Google’s Dialogflow.

Read on Medium

NER is a sub-task of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

NER is one of the features of TextSpace. To read how it compares in the feature intent classification, read this post. We have created our own small test data set with 11 examples taken from Google’s Taskmaster 2 data set, which was just released in February 2020. We see this data set as the benchmark data set for future research and products in NER solutions, hence we are very excited to explore how TextSpace compares already now to the aforementioned products. The phrases in this data set can vary in length and information contained, but we picked the ones which had a lot of entities in them, so we can separate the wheat from the chaff in terms of state-of-the-art NER solutions. The phrases in our test set come from diverse domains, because we wanted to see how flexible the solutions are we compare in this post. Note that Dialogflow’s pre-defined agents were used on these examples, so we expect that Dialogflow performs nearly perfectly on them. We’ll first go through two examples in detail, but have in the end a table which summarises all of our findings.

Flight ticket reservations

Let’s start with an example phrase from a user who wants to book a flight: “So, I would like to fly out sometime tonight and fly back in the evening in 4 days. From I’m looking to go to Denver. I’m flying out of San Francisco.”

IBM’s MAX Named Entity Tagger does not perform well and merely returns which words belong to the categories time and location. Unfortunately, it even misclassifies the word “I” as an event at the beginning of the second sentence. Compared to this, the StanfordNLP NER model performs much better. All possible entities are found and even the correct date is returned. However, one slight mistake happens by not recognising that the “evening in 4 days” belongs together. It classifies “evening” as the evening today and not the evening in four days.



The spaCy NER model performs equally to IBM’s and only returns that words belong to the categories time, date or location. It does not show the exact dates how StanfordNLP does.


Our solution TextSpace performs very well in this example and extracts all possible entities with the exact dates, similar to StanfordNLP. However, TextSpace also returns an interval for time periods which are not very specific, as for example “evening” or “tonight”. Here are first the locations:


And the time:


Unfortunately, TextSpace also does not recognise that “evening” and “in 4 days” belongs together and returns the evening of the day of writing, exactly as StanfordNLP does.



Using Dialogflow’s pre-built flight booking agent, we get the following mediocre results. It identifies the date and time correctly, the departure and arrival cities, but misses the “in 4 days” completely. Quite disappointing given we use a Google data set and Dialogflow’s domain-specific agent for booking flight tickets.


Hotel Booking

Next, we compare these four solutions on the following sentence:
“I would like to book a room in the Park Hyatt Aviara Resort and it costs $279 per night. It’s rated 4.8 stars. It offers an 18-hole golf course, an outdoor pool & tennis courts plus a spa & fine dining.”

We start again with IBM’s MAX Entity Tagger. The hotel we would like to stay at is recognised as a location and the “&” wrongly as an organisation. The price “$279” is completely missed, for example. As with our previous flight booking example, quite poor performance. StanfordNLP does a better job and finds the “$279” as money, the time of the stay and the number of stars. Unfortunately, our hotel is also categorised as a location rather than an organisation — which may do its job for some applications, but this may not work when the user would like to be automatically forwarded to the hotel’s website. It also does not recognise that the price of $279 belongs to “per night” and it returns “night” as the date.





We use Dialogflow’s pre-trained hotel booking agent, which makes us naturally expecting its results should be at least as good as the ones found by StanfordNLP. Unfortunately, this is not the case and not even the price of “$279” is found. It correctly classifies “Hyatt” as the venue chain, but wrongly classifies its business name as “Resort and it costs $279”. The date is also completely missed out.


spaCy does an equally poor job as seen in the previous example. Our hotel “Park Hyatt Aviara Resort” is classified as a facility (FAC), which is not quite right, but the “279” were correctly categorised as a monetary value. Unfortunately, “spa & fine” is wrongly seen as an organisation.


TextSpace does the best job amongst all five solutions here. It finds “Hyatt” as an organisation, the monetary value “$279”, skips “night” because it knows that the $279 are the price per night, and finds all numbers.

Correct classification of “Hyatt” and “$279” as organisation and amount-of-money, respectively.
Both numbers are found.

Overall results

As promised, we compare now how all of these five solutions compare across all 11 examples in our test set. The methodology to measure this is straightforward: we know which categories the solutions are pre-trained on and measure the percentage of how many of all possible categories are found in the example phrases. For example, in the sentence “So, I would like to fly out sometime tonight and fly back in the evening in 4 days. From I’m looking to go to Denver. I’m flying out of San Francisco.”, which we used previously for the flight ticket booking, a solution would achieve 100% precision and recall if it classified “tonight” and “evening in 4 days” as the correct date with time frame, “Denver” and “San Francisco” as locations or cities. In other words, all potential entities were found and all were correctly classified. As soon as one of them is missed, let’s say “evening in 4 days”, only three out of four are correct, which gives a recall of 75%. If additionally a word is misclassified, let’s say “I” as an organisation, the precision would drop to 4 out of 5, or 80%. If dates were found but only “time” or “date” returned without the precise date or time, we would subtract 0.5. TextSpace’s NER performs the best in comparison to these four other solutions, closely followed by StanfordNLP. IBM’s MAX Entity Tagger disappoints overall.

And here the numbers to these bar charts:



Reproducing our results

If you want to reproduce these results, here are the 11 phrases we used:
- Flight booking 1: “So, I would like to fly out sometime tonight and fly back in the evening in 4 days. From I’m looking to go to Denver. I’m flying out of San Francisco.”
- Flight booking 2: “Okay, you got it so it looks like United Airlines leaves at 9:20 p.m. that is nonstop the flight duration is 2 hours and 28 minutes and is priced at $337.”
- Flight booking 3: “I found a flight that leaves Seattle coming Monday at 7:35 a.m and arrives in Tampa at 4:10 p.m.”
- Hotel booking 1: “Park Hyatt Aviara resort Golf Club and Spa, it’s $279 per night. It‘s rated 4.8 stars. Resort offering an 18-hole golf course, an outdoor pool & tennis courts plus a spa & fine dining.”
- Hotel booking 2: “Staybridge Suites Carlsbad, it’s $145 per night. It’s rated 4.5 stars. Warm suites with kitchens in a relaxed property featuring an outdoor pool, a gym & a BBQ area.”
- Cinema 1: “The Mummy is playing at 4:30 pm this afternoon at Regal Davis Stadium 5.”
- Cinema 2: “I have Chips playing at 9:50 PM. I have Get Out playing at 10:15 PM. and Snaps playing at 10:25 PM.”
- Music 1: “Here’s a song diamonds originally by Rihanna Cover by One Voice Children’s Choir.”
- Restaurant 1: “Hi, currently I’m in IKEA, California. I’m looking for restaurant to eat dinner.”
- Restaurant 2: “I found a highly rated restaurant called Second Floor in Kitchen, it is 4.3 stars out of 5 and it’s described is New American tasting menus in a sophisticated setting. How does it sound?”
- Sports 1: “On last Saturday, September 9th, LA Dodgers played against New York Red Bulls and it was a draw with the score 1–1.” See our GitHub repository for more details on the implementation.

See our GitHub repository for more details on the implementation.


Comment on Medium