OCR with OCRopus and Tesseract

While OCRing a batch of images through OmniPage the other day, I was silently cursing my computer. I had about 1,500 pages, and OmniPage was crashing after every second or third image. I’ve used versions 13-16 of the software, and this problem seems to just get worse with each new release. Fed up, I decided to look for an alternative.

I remembered seeing a few years ago that HP had open-sourced their OCR engine, Tesseract, development of which has now been taken over by Google. Tesseract is supposedly very good at what it does, namely, recognizing characters in images.

Tesseract does not, however, have many essential features found in modern OCR software, including document layout analysis and output formatting. That’s where OCRopus comes in. I think of it as a wrapper around Tesseract, capable of doing the layout analysis and providing formatted output. In truth, it can do much more than that, and different OCR engines and other components can be plugged into OCRopus, but the preceding simplification works for my purposes.

Usage

Use OCRopus with a simple call from the command line:

$ ocroscript recognize /path/to/file.png > /path/to/output.html

OCRopus will work its magic on file.png and give you an hOCR file. hOCR uses class and title attributes in an otherwise simple HTML file to embed layout information into the recognized text. I hope soon to create a script to transform the hOCR into a PDF; I’ll post more when it’s ready.

Installation

The trickiest part of using OCRopus is the installation. There are quite a few dependencies and some inaccurate documentation, so I made a few wrong turns along the way. Fortunately, I remembered to document what I was doing as I went. The instructions below represent the necessary steps to have an operable installation of OCRopus on Linux Mint as of 2009-03-27. For the record, I’m starting in /var/tmp.

Install Tesseract

As mentioned above, Tesseract is the OCR engine that powers OCRopus.

$ svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
$ cd tesseract-ocr-read-only
$ ./configure
$ make
$ sudo make install
$ cd ..

Install iulib

iulib provides some basic image processing libraries used by OCRopus.

$ svn checkout http://iulib.googlecode.com/svn/trunk/ iulib
$ cd iulib
$ sudo apt-get install scons
$ sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev libavcodec-dev libavformat-dev libsdl-gfx1.2-dev libsdl-image1.2-dev
$ sudo apt-get install imagemagick
$ scons
$ sudo scons install
$ cd ..

Install Leptonica

Leptonica provides more image processing and layout analysis capabilities.

$ wget http://leptonica.googlecode.com/files/leptonlib-1.60.tar.gz
$ tar xvzf leptonlib-1.60.tar.gz
$ cd leptonlib-1.60
$ ./configure
$ make
$ sudo make install
$ cd ..

Install OpenFST

OpenFST provides language modeling code to OCRopus. Note that this takes a while (a couple of hours for me) to compile.

$ wget http://mohri-lt.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.1.tar.gz
$ tar xvzf openfst-1.1.tar.gz
$ cd openfst-1.1
$ ./configure
$ make
$ sudo make install
$ cd ..

Install OCRopus

We now have all our dependencies installed, so it’s time to install OCRopus.

$ sudo apt-get install libeditline-dev
$ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
$ cd ocropus
$ ./configure
$ make
$ sudo make install

Update (2009-04-01): OCRopus is still young and has many bugs. One particularly annoying bug, one that is quite easy to fix: the Doctype declaration for the hOCR file was missing some quotes, rendering the XHTML invalid. I’ve submitted a patch. So, some slightly revised installation instructions, picking up in the ocropus directory:

$ wget http://xplus3.net/downloads/fix_ocropus_doctype.diff
$ patch -p0 -i fix_ocropus_doctype.diff
$ ./configure
$ make
$ sudo make install

OCR with OCRopus and Tesseract

Usage

Installation

Install Tesseract

Install iulib

Install Leptonica

Install OpenFST

Install OCRopus

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Practice Sheet of Right form of verbs for HSC Students

Thread: Ticket to Ride Legacy: Legends of the West:: General:: [SPOILERS]...

Skint TV teen to be sentenced

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

libdevinfo を使ってネットワークインターフェイスデバイスの一覧を取得する

DD Kashir channel packaging bids invited by 29 june

HResult: 0x80240033 Context: uecGeneral Msg: The license terms of one or more...

Brunei reaffirms healthcare commitment

Muloraki Au

99 God Status for Whatsapp, Facebook

Kalank - Malayalam (1CD ) - subtitles

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

Gudur Mandal Sarpanch Wardmumbers Mobile Numbers List Warangal District in...

Mp3 Download: Mdu - Nammer

Ilahi mera jee aaye/ Shaame Malang si Lyrics Translation

spreading clines

Procedure for conduct of supplementary DPC