[Part 1] Evaluating Offline Handwritten Text Recognition: Which Machine Learning Model is the Winner?

A benchmarking comparison between models provided by Google, Azure, AWS as well as open source models (Tesseract, SimpleHTR, Kraken, OrigamiNet, tf2-crnn, and CTC Word Beam Search)


A use case that we have worked on recently is the analysis of handwritten documents. Many of our customers often have undetermined “large” bodies of unstructured data that include scanned or photographed handwritten documents — for example, old land deeds and signature blocks that need to be triaged and analyzed as part of the genealogical research process.

Joseph Hawley correspondence and documents

To identify the best approaches for handwriting recognition, we evaluated APIs from three of the major cloud providers (Google, AWS, and Microsoft) and six open source models for performance in processing handwritten text. In the first of this two-part series, we’ll give an overview of the models we tested and the evaluation framework we used.

Overview of Handwritten Text Recognition

Handwriting recognition… from paper to computer featuring the cats of Anno.ai

In this analysis, we focused on offline HTR techniques for processing handwritten texts after they have been written, vs. online techniques that actively process text while someone is writing, for example on a tablet or digital whiteboard.

Commercial Cloud APIs

  • Google Cloud Vision: The GCP Vision API supports handwriting detection with OCR in a wide variety of languages and can detect multiple languages within a single image. 60 languages are supported, 36 additional languages are under active development, and 133 are mapped to another language code or character set.
  • AWS Textract: The Textract API focuses on text extraction from documents, tables, and forms. It supports Latin-script characters and ASCII symbols.
  • AWS Rekognition: The Rekognition API focuses on image and video analysis, but can also extract text from images and videos.
  • Microsoft Azure Computer Vision Read: The Read API extracts handwritten text in English, and supports detecting both printed and handwritten text in the same image or document.

Open Source Models

  • Ease of installation
  • Support across different operating systems
  • Wide adoption and ongoing support/development
  • Licensing
  • Compute requirements (for both inference and training)
  • Retraining options and process
Created by Anno.Ai

Based on this assessment, we decided to include Tesseract and SimpleHTR in the quantitative evaluation along with the cloud provider APIs.

NIST Handwritten Forms and Characters Database

Evaluation Dataset

NIST Handwritten Forms and Characters Database

For our evaluation, we just used the cropped section of the form with the preamble to the U.S. Constitution (2,100 samples), and we used the same ground truth transcription of the preamble to compare against all the samples. It is worth noting that there are some slight variations across these samples in terms of how people wrote out the paragraph, including missing words and punctuation, corrections, and mixed use of “insure” vs. “ensure”. While we normalized both the input and ground truth texts during the evaluation (including removing punctuation), the missing words may have had some very slight impacts on our average evaluation numbers.

NIST Handwritten Forms and Characters Database
NIST Handwritten Forms and Characters Database

Evaluation Metrics

Character Error Rate (CER): CER is a commonly-used metric for OCR evaluation, and reflects how well the OCR/HTR output matches the ground truth text on a per-character basis (a lower CER score indicates a better match). We used the Python asrtoolkit library to calculate CER.

# Calculate Character Error Rate (CER)def cer(hypothesis, reference):    hypothesis = normalize(hypothesis)    reference = normalize(reference)    return asrtoolkit.cer(reference, hypothesis)

Match Score: Match score reflects the overall similarity of the OCR output to the ground truth text based on edit distance (a higher match rate score indicates a better match). We used the Python fuzzywuzzy library to calculate the match score. In our evaluation, we found that this metric provides an intuitive estimate for the overall readability of the output text.

# Calculate fuzzy string match score using edit distancedef fuzzy_score(hypothesis, reference):    hypothesis = normalize(hypothesis)    reference = normalize(reference)    return fuzz.ratio(hypothesis, reference)

What’s Next







Operationalizing applied machine learning for the mission.