Search
The Way of the Software Engineer

Optical Character Recognition is one of those technologies that has been around for a long time and never quite met customer demands. This is a common AI application, but I thought I’d see what’s currently available publicly instead of trying to write my own from scratch. The primary options I found were PHPOCR, GOCR, and Tesseract. PHPOCR is a system written by a developer in the Ukraine as a platform for further OCR research. The examples were very easy to get working, but it’s not the quickest solution for my project. GOCR is generally considered the market winner. It installs easily via the OS X macports tree, and works quickly on the command line. I also downloaded Tesseract, but with GOCR’s ease of use and prevalence in the open source community, I thought I’d try that first.

A friend of mine made a really cool bookmarklet that lets the user select a bounding box on an image, then sends an AJAX call to his server which downloads the image, crops it to the bounding box, and pipes it through GOCR (GNU Optical Character Recognition). The result is then dropped in a div exactly positioned over the originally selected bounding box approximating text size and color. The goal is to make it possible to copy and paste text out of an image. It works quite well and it’s clean interface makes it a beautiful thing to watch.

Given the complete failure by the web development community to accurately populate image alt attributes, I thought it would be slick to grab all images on a page, get any embedded text and automatically populate the alt text in much the same way as my friend’s bookmarklet draws a div. I ran a few tests by piping web comics through GOCR with horrible results. Accuracy couldn’t have been above 10%. I told my friend about my results hoping he could give me some insight as to why GOCR was failing me so badly. As I probably should have expected, the simple bounding box he uses is really important. GOCR doesn’t have any layout detection.

Google open sourced a big OCR package called Tesseract originally shelved by HP in the 1980s and is (I believe) using it to scan books and make them available on the net. This is one of Google’s many efforts to make all the world’s information available. I was hoping it’s touted increased accuracy would help overcome the lack of a bounding box in my application. It compiles and installs without errors and it runs just fine on the test images provided, but producing images it can read is a challenge. I’ve tried converting saving gif files as tiff from Pixelmator and Tesseract gave me errors. I tried using ImageMagick and a little bash script I found with the same results. Tesseract complains about minute differences in tiff header information (datetime format, bpp info, etc), so some care is needed. I’m not sure if this has something to do with my version of libtiff (v3.6.1) that Tesseract is using, or if there’s some parsing code in Tesseract that isn’t happy.

I finally did manage to get things to work by creating a very basic bash script and using the simplest settings for ImageMagick. The goal was to have the bash script work the same way GOCR’s command line utility works.

#!/bin/bash

tmpid=$$
convert -compress none $1 /tmp/img.${tmpid}.tif

tesseract /tmp/img.${tmpid}.tif /tmp/tout.${tmpid} 2> /dev/null
cat /tmp/tout.${tmpid}.txt | perl -e ‘while(<>) { $_ =~ s/\|//g; $_ =~ s/\^~R//g; print $_ } ‘
rm /tmp/tout.${tmpid}.*
rm /tmp/img.${tmpid}.tif

Now that I have two tools who’s interface is the same, I can write a wrapper around them to use in PHP and compare their performance. I’ll handle that in another post.

Something to say?

You must be logged in to post a comment.