Linux OCR

by James McDonaldJul 17, 2014IT Tips0 comments

These are very brief notes to capture what I've been trying to achieve this evening with Linux OCR.

WatchOCR

I've downloaded and installed the WatchOCR deb package from http://sourceforge.net/projects/watchocr/

To get it running on Fedora 20 I had to use alien to convert it to an RPM and then install as root with rpm -i --force watchocr-0.7.2-2.noarch.rpm

Once I created an output directory (mkdir /home/jm/scan-out and ran watchocr -i /home/jm/scans -o /home/jm/scan-out I found that it worked but:

didn't always accurately layout the output text on the search able PDF ( I was searching for "Invoice" in the following image)

cuneiform (a program called by watchocr) fell over with errors and the watchocr script removed some pdf's in the input directory without a corresponding out put searchable PDF being made.

Also when my scanner was dropping a PDF in the "in" folder via SMB watchocr would pick it up before the PDF had been fully written to the folder.

Scans that had text 90 degrees from what was expected didn't get a searchable PDF layer so no searching...

The WatchOCR deb package appears to be from 2011

Another Option

I am also looking at http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ which seems to have an intelligent and comprehensive commentary plus a sample script to work from.

I ran some of the commands from the scan-archive.sh script and found that I was missing the tesseract-osd package.

The text to scanned image layout seemed to be better. But I need more time to figure out the best combination of command line settings to make the result acceptible.

Common Problems

I am finding that utilities such as exactimage used with the above aren't readily available in standard Fedora 20 repositories so I had to search to install them (and then chose old package versions). They seem to be more readily installable under Debian/Ubuntu

The Way Forward Maybe

The entire Folderwatch, OCR, Searchable PDF Outputting toolchain needs to be worked on to allow a clean in-out. But my fear is that the complexity of getting it all working would make the time penalty prohibitive. So if anyone reads this what options are available for Linux automated scan to Searchable PDF?

PDF to PNG Conversion

If I run convert with the defaults I get very bad image quality

convert scan.pdf scan.png

The fix is to run the following which will export in 600 Pixels Per Inch

convert -density -units PixelsPerInch 600 scan.pdf scan.png

Use convert -list to see your setting options

# to see the options for
# the -units setting
convert -list units
PixelsPerInch
PixelsPerCentimeter

# to see all list types
convert -list list
Align
Alpha
Boolean
Cache
Channel
Class
ClipPath
Coder
Color
Colorspace
Command
Compose
Compress
Configure
DataType
Debug
Decoration
Delegate
Direction
Dispose
Distort
Dither
Endian
Evaluate
FillRule
Filter
Font
Format
Function
Gravity
Intensity
Intent
Interlace
Interpolate
Kernel
Layers
LineCap
LineJoin
List
Locale
LogEvent
Log
Magic
Method
Metric
Mime
Mode
Morphology
Module
Noise
Orientation
PixelIntensity
Policy
PolicyDomain
PolicyRights
Preview
Primitive
QuantumFormat
Resource
SparseColor
Statistic
Storage
Stretch
Style
Threshold
Type
Units
Validate
VirtualPixel

If I'm going to create a scripted bulk OCR then using the inotify-tools package and the inotifywait program may be a good start....

http://stackoverflow.com/questions/18692134/continuously-monitor-a-directory-in-linux-and-notify-when-a-new-file-is-availabl

inotifywait -m -e create ~/somedir/ | while read line
do
    echo $line
done

Linux OCR

0 Comments

Submit a Comment Cancel reply