[Part 1] Evaluating Offline Handwritten Text Recognition: Which Machine Learning Model is the Winner?

A benchmarking comparison between models provided by Google, Azure, AWS as well as open source models (Tesseract, SimpleHTR, Kraken, OrigamiNet, tf2-crnn, and CTC Word Beam Search)

5 min readSep 1, 2020


The Anno.Ai data science team regularly produces evaluations of AI/ML model capabilities across both commercial vendors and open source libraries. We do this to help scope the best approach(es) for specific use cases and deployment environments, identify options for ensemble model development, and support integration of best-of-breed models into our core software application.

A use case that we have worked on recently is the analysis of handwritten documents. Many of our customers often have undetermined “large” bodies of unstructured data that include scanned or photographed handwritten documents — for example, old land deeds and signature blocks that need to be triaged and analyzed as part of the genealogical research process.

Joseph Hawley correspondence and documents

To identify the best approaches for handwriting recognition, we evaluated APIs from three of the major cloud providers (Google, AWS, and Microsoft) and six open source models for performance in processing handwritten text. In the first of this two-part series, we’ll give an overview of the models we tested and the evaluation framework we used.

Overview of Handwritten Text Recognition

Optical character recognition (OCR) involves the conversion of typed or printed text (for example, from a document, a photo of a document, or a natural scene photo) into machine-encoded text. While OCR is generally considered a well-solved problem, Handwritten Text Recognition (HTR), a subset of OCR specifically focused on recognizing handwriting, is more challenging due to the variation in handwriting styles between different people.

Handwriting recognition… from paper to computer featuring the cats of Anno.ai

In this analysis, we focused on offline HTR techniques for processing handwritten texts after they have been written, vs. online techniques that actively process text while someone is writing, for example on a tablet or digital whiteboard.

Commercial Cloud APIs

We evaluated four APIs from Google, AWS, and Microsoft Azure:

  • Google Cloud Vision: The GCP Vision API supports handwriting detection with OCR in a wide variety of languages and can detect multiple languages within a single image. 60 languages are supported, 36 additional languages are under active development, and 133 are mapped to another language code or character set.
  • AWS Textract: The Textract API focuses on text extraction from documents, tables, and forms. It supports Latin-script characters and ASCII symbols.
  • AWS Rekognition: The Rekognition API focuses on image and video analysis, but can also extract text from images and videos.
  • Microsoft Azure Computer Vision Read: The Read API extracts handwritten text in English, and supports detecting both printed and handwritten text in the same image or document.

Open Source Models

We also researched and reviewed open source HTR models and related repositories. We initially installed six open source models and down-selected them based on the following criteria:

  • Ease of installation
  • Support across different operating systems
  • Wide adoption and ongoing support/development
  • Licensing
  • Compute requirements (for both inference and training)
  • Retraining options and process
Created by Anno.Ai

Based on this assessment, we decided to include Tesseract and SimpleHTR in the quantitative evaluation along with the cloud provider APIs.

NIST Handwritten Forms and Characters Database

Evaluation Dataset

We used the NIST Handwritten Forms and Characters Database (NIST Special Database 19) as our evaluation dataset. The database includes handwriting sample forms from 3,600 writers, including number and letter samples. For a subset of the forms, contributors also filled out a text box with the preamble to the U.S. Constitution.

NIST Handwritten Forms and Characters Database

For our evaluation, we just used the cropped section of the form with the preamble to the U.S. Constitution (2,100 samples), and we used the same ground truth transcription of the preamble to compare against all the samples. It is worth noting that there are some slight variations across these samples in terms of how people wrote out the paragraph, including missing words and punctuation, corrections, and mixed use of “insure” vs. “ensure”. While we normalized both the input and ground truth texts during the evaluation (including removing punctuation), the missing words may have had some very slight impacts on our average evaluation numbers.

NIST Handwritten Forms and Characters Database
NIST Handwritten Forms and Characters Database

Evaluation Metrics

We used the following metrics for evaluation. We normalized both the input and ground truth texts before calculating the metrics, including removing punctuation, numbers, and leading/trailing white space.

Character Error Rate (CER): CER is a commonly-used metric for OCR evaluation, and reflects how well the OCR/HTR output matches the ground truth text on a per-character basis (a lower CER score indicates a better match). We used the Python asrtoolkit library to calculate CER.

# Calculate Character Error Rate (CER)def cer(hypothesis, reference):    hypothesis = normalize(hypothesis)    reference = normalize(reference)    return asrtoolkit.cer(reference, hypothesis)

Match Score: Match score reflects the overall similarity of the OCR output to the ground truth text based on edit distance (a higher match rate score indicates a better match). We used the Python fuzzywuzzy library to calculate the match score. In our evaluation, we found that this metric provides an intuitive estimate for the overall readability of the output text.

# Calculate fuzzy string match score using edit distancedef fuzzy_score(hypothesis, reference):    hypothesis = normalize(hypothesis)    reference = normalize(reference)    return fuzz.ratio(hypothesis, reference)

What’s Next

In the second part of this two-part series, we’ll look at the results from our evaluation and discuss the best HTR models to use for different use cases.







About the Authors

Joe Sherman is a Principal Data Scientist at Anno.ai, where he leads several efforts focused on operational applications of advanced machine learning. Joe has a passion for developing usable and actionable machine learning models and techniques and bringing the cutting edge to operational users. Prior to his work at Anno, Joe led the research and development of a number of machine learning applications for the Department of Defense, focusing on deploying analytics at scale. Joe holds a BS in Chemistry from Virginia Tech.

Ashley Antonides is Anno.Ai’s Chief Artificial Intelligence Officer. Ashley has over 20 years of experience leading the research and development of machine learning and computer vision applications in the national security, public health, and commercial sectors. Ashley holds a B.S. in Symbolic Systems from Stanford University and a Ph.D. in Environmental Science, Policy, and Management from the University of California, Berkeley.




Bringing online retail analytics to a brick and mortar world