OCR – Tesseract

by James McDonaldAug 25, 2008IT Tips, Linux Tools, Open Source Apps0 comments

Just made a spectacularly unsuccessful attempt to use Tesseract OCR. Here is a sample:

"|I11;XII;"1n -¤:2;;2:¤ LIEEEEEEE.-::2;;;: nz ’‘·* *--2.. ::2;;: "=I;;;. :1:;*-- ::2;£2|a:‘¤I;;;.XXi.X ·¤:;1;;¤· X;|:;t1X '!EEEXI|· XXXIIEX--EXIIXX-.:21:;. -·¤a.;¤¤a..a·· u:11XIIXX·¤:ii1:¤. uszizez; 1||:;‘tz|XX u:;;;se 11|:t’·· .::22z¤nX;1|:;‘tt|;X .I:11XEIXX ;|g;g;;;;* {ii, *·¤a..¤·* u;;;;:: ;n·":nX "XEE" ¤l§X,X..s**' XXXII .,,, ¤|ZXXXZ|l .. X"‘i..i"*i..?"`XXII;"tnX.:z;;2||.i‘|i;1. -¤:;;;!: ::222u.X

That was before I Googled and found the link to a helpful howto forge page.

Once I followed that, tesseract spat out suprisingly accurate text _and_ punctuation. Although I didn't use the suggested ImageMagick convert tool, because contrary to the howto GIMP v2.4.6 spat out a useable TIFF format just fine. One thing I did notice was I had to use Image ==> Flatten Image to get rid of the alpha channel before the save as TIFF option would work.

Sometimes it's not only the tools but technique also.

OCR – Tesseract

0 Comments

Submit a Comment Cancel reply