[Part 1] Evaluating Offline Handwritten Text Recognition: Which Machine Learning Model is the Winner?
The Anno.Ai data science team regularly produces evaluations of AI/ML model capabilities across both commercial vendors and open source libraries. We do this to help scope the best approach(es) for specific use cases and deployment environments, identify options for ensemble model development, and support integration of best-of-breed models into our core software application.
A use case that we have worked on recently is the analysis of handwritten documents. Many of our customers often have undetermined “large” bodies of unstructured data that include scanned or photographed handwritten documents — for example, old land deeds and signature blocks that need to be triaged and analyzed as part of the genealogical research process.
To identify the best approaches for handwriting recognition, we evaluated APIs from three of the major cloud providers (Google, AWS, and Microsoft) and six open source models for performance in processing handwritten text. In the first of this two-part series, we’ll give an overview of the models we tested and the evaluation framework we used.
Overview of Handwritten Text Recognition
Optical character recognition (OCR) involves the conversion of typed or printed text (for example, from a document, a photo of a document, or a natural scene photo) into machine-encoded text. While OCR is generally considered a well-solved problem, Handwritten Text Recognition (HTR), a subset of OCR specifically focused on recognizing handwriting, is more challenging due to the variation in handwriting styles between different people.
In this analysis, we focused on offline HTR techniques for processing handwritten texts after they have been written, vs. online techniques that actively process text while someone is writing, for example on a tablet or digital whiteboard.
Commercial Cloud APIs
We evaluated four APIs from Google, AWS, and Microsoft Azure:
- Google Cloud Vision: The GCP Vision API supports handwriting detection with OCR in a wide variety of languages and can detect multiple languages within a single image. 60 languages are supported, 36 additional languages are under active development, and 133 are mapped to another language code or character set.
- AWS Textract: The Textract API focuses on text extraction from documents, tables, and forms. It supports Latin-script characters and ASCII symbols.
- AWS Rekognition: The Rekognition API focuses on image and video analysis, but can also extract text from images and videos.
- Microsoft Azure Computer Vision Read: The Read API extracts handwritten text in English, and supports detecting both printed and handwritten text in the same image or document.
Open Source Models
We also researched and reviewed open source HTR models and related repositories. We initially installed six open source models and down-selected them based on the following criteria:
- Ease of installation
- Support across different operating systems
- Wide adoption and ongoing support/development
- Compute requirements (for both inference and training)
- Retraining options and process
Based on this assessment, we decided to include Tesseract and SimpleHTR in the quantitative evaluation along with the cloud provider APIs.
We used the NIST Handwritten Forms and Characters Database (NIST Special Database 19) as our evaluation dataset. The database includes handwriting sample forms from 3,600 writers, including number and letter samples. For a subset of the forms, contributors also filled out a text box with the preamble to the U.S. Constitution.
For our evaluation, we just used the cropped section of the form with the preamble to the U.S. Constitution (2,100 samples), and we used the same ground truth transcription of the preamble to compare against all the samples. It is worth noting that there are some slight variations across these samples in terms of how people wrote out the paragraph, including missing words and punctuation, corrections, and mixed use of “insure” vs. “ensure”. While we normalized both the input and ground truth texts during the evaluation (including removing punctuation), the missing words may have had some very slight impacts on our average evaluation numbers.
We used the following metrics for evaluation. We normalized both the input and ground truth texts before calculating the metrics, including removing punctuation, numbers, and leading/trailing white space.
Character Error Rate (CER): CER is a commonly-used metric for OCR evaluation, and reflects how well the OCR/HTR output matches the ground truth text on a per-character basis (a lower CER score indicates a better match). We used the Python asrtoolkit library to calculate CER.
# Calculate Character Error Rate (CER)def cer(hypothesis, reference): hypothesis = normalize(hypothesis) reference = normalize(reference) return asrtoolkit.cer(reference, hypothesis)
Match Score: Match score reflects the overall similarity of the OCR output to the ground truth text based on edit distance (a higher match rate score indicates a better match). We used the Python fuzzywuzzy library to calculate the match score. In our evaluation, we found that this metric provides an intuitive estimate for the overall readability of the output text.
# Calculate fuzzy string match score using edit distancedef fuzzy_score(hypothesis, reference): hypothesis = normalize(hypothesis) reference = normalize(reference) return fuzz.ratio(hypothesis, reference)
In the second part of this two-part series, we’ll look at the results from our evaluation and discuss the best HTR models to use for different use cases.