OCR – Tesseract

Written by James McDonald

August 25, 2008

Just made a spectacularly unsuccessful attempt to use Tesseract OCR. Here is a sample:

"|I11;XII;"1n -¤:2;;2:¤ LIEEEEEEE.-::2;;;: nz ’‘·* *--2..
::2;;: "=I;;;. :1:;*-- ::2;£2|a:‘¤I;;;.XXi.X ·¤:;1;;¤· X;|:;t1X
'!EEEXI|· XXXIIEX--EXIIXX-.:21:;. -·¤a.;¤¤a..a·· u:11XIIXX·¤:ii1:¤. uszizez;
1||:;‘tz|XX u:;;;se 11|:t’·· .::22z¤nX;1|:;‘tt|;X .I:11XEIXX
;|g;g;;;;* {ii, *·¤a..¤·* u;;;;:: ;n·":nX "XEE" ¤l§X,X..s**'
XXXII .,,, ¤|ZXXXZ|l .. X"‘i..i"*i..?"`XXII;"tnX.:z;;2||.i‘|i;1. -¤:;;;!: ::222u.X

That was before I Googled and found the link to a helpful howto forge page.

Once I followed that, tesseract spat out suprisingly accurate text _and_ punctuation. Although I didn’t use the suggested ImageMagick convert tool, because contrary to the howto GIMP v2.4.6 spat out a useable TIFF format just fine. One thing I did notice was I had to use Image ==> Flatten Image to get rid of the alpha channel before the save as TIFF option would work.

Sometimes it’s not only the tools but technique also.


