People, Places, Things! Evaluating Named Entity Recognition
5 min readOct 7, 2020

An evaluation of Named Entity Recognition models for commercial NLP offerings.

Photo by Max Chen on Unsplash


As part of our series on AI/ML model evaluations, the Anno.Ai data science team delved into the world of Natural Language Processing (NLP). Many of our customers have NLP needs, so we decided to explore a variety of online and offline NLP libraries and services. For this task our data science team looked into Named Entity Recognition (NER) and this article is specifically focused on online vendors.

What is NER?

Given a collection of documents or other unstructured text, it is useful to be able to identify and extract information that falls into some predefined category, be it people, places, or things. NER provides us the ability to extract useful concepts from large bodies of text without a human ever having to read the document!

Photo by Patrick Tomasso on Unsplash

Commercial Models

There is a wide variety of models available for such a task. For purposes of this evaluation, we chose to evaluate models based on how well the general model extended to customer specific use cases, as well as ease of install and ability to acquire trial licenses.

For commercial vendors, we evaluated the following companies on performance against NER tasking:

  • AWS Comprehend: The Comprehend API provides endpoints for analyzing text documents, including extracting entities, key phrases, language, and sentiment. It can also be used for document clustering and topic modeling.
  • Microsoft Azure Cognitive Services: Azure’s Text Analytics API provides endpoints for sentiment analysis, language identification, key phrase extraction, and NER.
  • Finch Text Analytics: The Finch for Text API focuses on entity extraction from text, and also includes capabilities for entity disambiguation and additional enrichments. It also supports sentiment analysis.
  • Rosette Text Analytics: The Rosette Text Analytics API supports NER but also provides access to a range of lower level NLP features such as tokenization, language identification, and sentence tagging. It supports topic modeling, transliteration, name deduplication, and sentiment analysis as well.

Evaluation Data Set

For practical purposes, we recommend using an established tagged dataset in evaluations as opposed to making one from scratch. The world of NER becomes fuzzy very quickly and tagging methods and results are open to interpretation. For this evaluation, we chose the Ontonotes dataset, a well-used tagged dataset that has been collectively tagged and reviewed by professionals. The Ontonotes dataset provides free text, as well as any tagged entities embedded in the text itself. The dataset is available for download from the Linguistic Data Consortium, though you will need to register for an account and be approved for download. For this evaluation, we exclusively focused on the English text.

Photo by Waldemar Brandt on Unsplash

Evaluation Methodology

To evaluate and score the different services, the entities and the original text were separated into input documents, which simply consisted of the original text without any entities specified, and a ‘truth’ set, which was a tokenized version of the input document mapping each word to its associated entity. In total for the evaluation, approximately 3,600 documents were used consisting of 200,000+ relevant tags.

To quantitatively measure the performance, the tokenized NER output of each document for each service was compared to its tokenized truth counterpart. Each output word was compared to the truth file and assigned a value of True Positive (TP), False Positive (FP), True Negative (TN), or False Negative (FN). Collectively across all documents for a service, the TP, FP, TN, FN counts were added and precision, recall, and F1 scores were calculated per entity (we took a look at ‘person, ‘locations’ organizations’ and ‘dates’ as the core entity set for the evaluation). Furthermore, we incremented the confidence threshold values for each detection to get a feel for overall performance. Confidence intervals were 0% — 90% in 10% increments as well as 95% and 99% (which is why in some cases, we see sharp drop-offs in models that do not provide confidences of 100%).

Final Metrics — Precision and Recall at Incremented Confidences for Online NER Services, Image by Anno.Ai


Overall Accuracy

Overall, Finch provided very consistent performance across all tags of interest. It proved more reliable in not only performance, but was also less sensitive to input variance.

Consistency Across Longer Documents

Many online text processing and NER service providers limit the length of text being processed in a single call. To get around these limitations, documents can be split into more manageable pieces — often split by paragraphs. Some of the other providers show variance in displayed tags before and after splitting documents (meaning the service would tag entities differently depending on how the document was split). Finch was the only provider to prove resilient to this variance.

Additional Enrichments

While the other services provided nice performance overall, Finch really stood out! Rather than just providing annotations and NER tags, they also provided a detailed level of enrichment that surpassed all other vendors by a long shot. Examples of enrichment were things like gender, city populations, geographic coordinates of places as well as entity disambiguation. A sample output from Finch is shown below.

Sample output from Finch — text extracted from the Veterans History Project, Image by Anno.Ai

What’s Next

We’re planning to post more model evaluations soon, so please keep an eye out for our upcoming posts on Automatic License Plate Recognition (ALPR) and machine translation models!

About the Author

Joe Sherman is a Principal Data Scientist at, where he leads several efforts focused on operational applications of advanced machine learning. Joe has a passion for developing usable and actionable machine learning models and techniques and bringing the cutting edge to operational users. Prior to his work at Anno, Joe led the research and development of a number of machine learning applications for the Department of Defense, focusing on deploying analytics at scale. Joe holds a BS in Chemistry from Virginia Tech.

Be sure to follow us on Twitter and LinkedIn!



Bringing online retail analytics to a brick and mortar world