These are very brief notes to capture what I’ve been trying to achieve this evening with Linux OCR.
I’ve downloaded and installed the WatchOCR deb package from http://sourceforge.net/projects/watchocr/
To get it running on Fedora 20 I had to use alien to convert it to an RPM and then install as root with rpm -i –force watchocr-0.7.2-2.noarch.rpm
Once I created an output directory (mkdir /home/jm/scan-out and ran watchocr -i /home/jm/scans -o /home/jm/scan-out I found that it worked but:
didn’t always accurately layout the output text on the search able PDF ( I was searching for “Invoice” in the following image)
cuneiform (a program called by watchocr) fell over with errors and the watchocr script removed some pdf’s in the input directory without a corresponding out put searchable PDF being made.
Also when my scanner was dropping a PDF in the “in” folder via SMB watchocr would pick it up before the PDF had been fully written to the folder.
Scans that had text 90 degrees from what was expected didn’t get a searchable PDF layer so no searching…
The WatchOCR deb package appears to be from 2011
I am also looking at http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ which seems to have an intelligent and comprehensive commentary plus a sample script to work from.
I ran some of the commands from the scan-archive.sh script and found that I was missing the tesseract-osd package.
The text to scanned image layout seemed to be better. But I need more time to figure out the best combination of command line settings to make the result acceptible.
I am finding that utilities such as exactimage used with the above aren’t readily available in standard Fedora 20 repositories so I had to search to install them (and then chose old package versions). They seem to be more readily installable under Debian/Ubuntu
The Way Forward Maybe
The entire Folderwatch, OCR, Searchable PDF Outputting toolchain needs to be worked on to allow a clean in-out. But my fear is that the complexity of getting it all working would make the time penalty prohibitive. So if anyone reads this what options are available for Linux automated scan to Searchable PDF?
PDF to PNG Conversion
If I run convert with the defaults I get very bad image quality
convert scan.pdf scan.png
The fix is to run the following which will export in 600 Pixels Per Inch
convert -density -units PixelsPerInch 600 scan.pdf scan.png
Use convert -list to see your setting options
# to see the options for # the -units setting convert -list units PixelsPerInch PixelsPerCentimeter # to see all list types convert -list list Align Alpha Boolean Cache Channel Class ClipPath Coder Color Colorspace Command Compose Compress Configure DataType Debug Decoration Delegate Direction Dispose Distort Dither Endian Evaluate FillRule Filter Font Format Function Gravity Intensity Intent Interlace Interpolate Kernel Layers LineCap LineJoin List Locale LogEvent Log Magic Method Metric Mime Mode Morphology Module Noise Orientation PixelIntensity Policy PolicyDomain PolicyRights Preview Primitive QuantumFormat Resource SparseColor Statistic Storage Stretch Style Threshold Type Units Validate VirtualPixel
If I’m going to create a scripted bulk OCR then using the inotify-tools package and the inotifywait program may be a good start….
inotifywait -m -e create ~/somedir/ | while read line do echo $line done