[Part 2] Evaluating Offline Handwritten Text Recognition: Which Machine Learning Model is the Winner?
As we discussed in our previous post, Handwritten Text Recognition (HTR) involves the conversion of handwritten text into machine-encoded text, which can be challenging due to the variation in handwriting styles between different people. After an initial model review and down-selection, we evaluated APIs from three of the major cloud providers (Google, AWS, and Microsoft) and two open source models (Tesseract and SimpleHTR) for performance in processing handwritten text. In this post, we share the results from our evaluation and discuss the best HTR models to use for different use cases.
HTR Evaluation Results
The Google Cloud Vision API was clearly the top performer with an average CER of 9.0 and an average match rate of 90.44. We’ve also included examples below to visualize the segmentation and detection outputs of the different models when run on the same image (f0035_19_crop.png).
Google Cloud Vision API example
AWS Textract API example
AWS Rekognition API example
Microsoft Azure Read API example
We found that the Google Cloud Vision API has the best performance and flexibility to recognize handwritten text. Due to its broad language support, the Google API also most gracefully handles non-Latin scripts, languages other than English, and use cases where there are a mix of languages in the text. The AWS Textract and Rekognition APIs also performed relatively well on the NIST dataset, although we found that performance dropped off during separate testing on datasets with non-Latin-script characters.
For open source models and projects requiring on-premise deployments, Tesseract works well although it does require some image preprocessing for optimal performance. Tesseract can detect the script that the text is written in, but unlike the Google API, requires additional language identification (LID) integration or that the user specify which language model or models to use. Both Tesseract and SimpleHTR can be retrained on additional handwriting data (for Tesseract, see the tesstrain repo), which is useful for custom datasets where the out-of-the-box models may not perform as well. However, to develop robust, generalizable models, both require a large variety of handwritten samples along with ‘ground truth’ transcriptions.
We’re planning to post more model evaluations soon, so please keep an eye out for our next posts on Automatic License Plate Recognition (ALPR) and Named Entity Recognition (NER) services and models!
Chung, J. and Deltiel, T. (2019) A Computationally Efficient Pipeline Approach to Full Page Offline Handwritten Text Recognition. CVPR 2019.
Grother, P. and Hanaoka, K. (2016) NIST Handwritten Forms and Characters Database (NIST Special Database 19). DOI: http://doi.org/10.18434/T4H01C
Ingle, R. et al. (2019) A Scalable Handwritten Text Recognition System. ICDAR 2019.
Yousef, M. and Bishop, T. (2020) OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by Learning to Unfold. CVPR 2020.
About the Authors
Joe Sherman is a Principal Data Scientist at Anno.Ai, where he leads several efforts focused on operational applications of advanced machine learning. Joe has a passion for developing usable and actionable machine learning models and techniques and bringing the cutting edge to operational users. Prior to his work at Anno, Joe led the research and development of a number of machine learning applications for the Department of Defense, focusing on deploying analytics at scale. Joe holds a BS in Chemistry from Virginia Tech.
Ashley Antonides is Anno.Ai’s Chief Artificial Intelligence Officer. Ashley has over 20 years of experience leading the research and development of machine learning and computer vision applications in the national security, public health, and commercial sectors. Ashley holds a B.S. in Symbolic Systems from Stanford University and a Ph.D. in Environmental Science, Policy, and Management from the University of California, Berkeley.