Convert hOCR to PDF

As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags’ class and title attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="ocr_line ocr_page" name="ocr-capabilities"/>
    <meta content="en" name="ocr-langs"/>
    <meta content="Latn" name="ocr-scripts"/>
    <meta content="" name="ocr-microformats"/>
    <title>OCR Output</title>
  </head>
  <body>
    <div class="ocr_page" title="bbox 0 0 2548 3300; image /path/to/scanned/image.png">
      <span class="ocr_line" title="bbox 659 143 863 177">Some Text</span>
      <span class="ocr_line" title="bbox 723 275 916 324">More Text</span>
    </div>
  </body>
</html>

For the more details, see Thomas Breuel’s complete hOCR specification draft. The example, though, shows all that I need to know. From the div[@class='ocr_page'], I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a span[@class='ocr_line'], I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.

That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.

Creating a PDF

A while back, Florian Hackenberger created a basic hOCR to PDF converter in Java. It does its job reasonably well, but it’s somewhat rough around the edges. Since I’m more familiar with Python than Java, it seemed like a good idea to rewrite the code in Python, so I could more comfortably hack it.

And so I present to you, dear reader, the hOCR Converter Python script (note: this script depends on the ReportLab PDF Library, which, in turn, depends on the FreeType 2 Font Engine). Right now, it has two primary functions:

Given an hOCR file, create a text-only (i.e., no HTML) document
Given an hOCR file and an image, create a PDF

Usage is pretty simple:

from HocrConverter import HocrConverter
hocr = HocrConverter("myHocrFile.html")
hocr.to_text("output.txt")
hocr.to_pdf("myImageFile.png", "output.pdf")

You end up with a text file at output.txt containing the contents of the body of the hOCR document and a PDF file at output.pdf containing your image superimposed on top of the text.

Note that the image you use to create the PDF need not be the same image used for the OCR. If, for example, you used a 300 or 400 DPI image for the OCR, but you want a smaller file for the PDF, you can create a 72 DPI version of the image and feed that through the script instead of the original.

With a simple script that iterates through a directory of images, calling OCRopus and this script for each image (maybe with some ImageMagick and pdftk thrown in for good measure), you can quickly OCR a large batch of images and convert them to searchable PDF documents.

Future Development

I’d like to add the capability to convert the output of various OCR programs into hOCR format. For example, OmniPage, which is much better than OCRopus when it comes to layout analysis for complex images, can output documents in its proprietary XML schema. I should be able to transform that into an hOCR document and then use this script to create a PDF from there.

I would welcome any other suggestions or contributions.

Download the HocrConverter script from github

Convert hOCR to PDF

Creating a PDF

Future Development

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112