Turning PostScript into Useful Text

Adam M. Costello
Dec 1996

PostScript, originally designed strictly as a language for instructing printers where to put ink on a page, has become one of the most common document interchange formats, mainly because it one of the few formats that virtually everyone can view (thanks in large part to the authors of Ghostscript). Unfortunately, it was not designed to be an interchange format, and suffers from some severe limitations. The fact that it's a full-blown programming language makes it very efficient for driving printers, but also makes PostScript “documents” uneditable. You can't even re-wrap the columns to fit your screen. And most apallingly and frustratingly (in my opinion), you can't select and copy text from it!

Or can you?

What I Did

First, let me issue a disclaimer: I don't think we ought to be bending over backwards to use PostScript in situations where it is cleary not suited. On the other hand, I like a challenge, and I love a good hack. The code in a PostScript file is telling the graphics engine which glyphs to paint where, so in theory it must be possible to extract enough information to allow a user to select text from a displayed page. I made it my mission. And my CS 294-5 class project.

One (small) aspect of Professor Wilensky's Digital Library project is selecting text from bit images, which is similar to my problem. He has pointed out that a feasible and successful method for extracting useful textual data from a PostScript file is to render it and then run Optical Character Recognition (OCR) software on the image. The main drawbacks I see are the expense in terms of computation time and money (I know of no free OCR programs, probably because OCR is very difficult). So I retained my motivation to convert directly from PostScript.

I correctly suspected that there was some prior art. The first thing I found was ps2ascii.ps, a tool bundled with Ghostscript for extracting ASCII text from a postscript file. From this example I learned the basic mechanism: redefining the PostScript operators that draw glyphs. I also learned the main difficulty: PostScript files are free to use any mapping from integers to glyph names to actual glyphs, and sometimes the ones they choose are not very enlightening (especially in files produced by dvips, which also creates fonts without bounding boxes, containing glyphs that crash the PostScript interpreter). ps2ascii.ps “solves” the dvips problem with a hack as frightful as it is clever: by redefining a few of dvips's private internal procedures before dvips calls them.

ps2ascii.ps did not provide all the features I needed, so I set about modifing it (after a crash course in PostScript) to give me more suitable information. It had already had three authors, though, and the code was quite difficult to comprehend. It appears to have been originally aimed at extracting plain text, with options for providing more detailed geometric information, but the interactions between the two tasks were subtle. I managed to get it working well enough to feed its output into a Java applet (after a crash course in Java) that associated the geometric information with the bit image (generated by Ghostscript), enabling me to select text. (Yay!)

Just as I was thinking about how I would enhance this tool set, I stumbled upon some more prior art: The Virtual Paper project at DEC. Basically, they had already built what I was trying to build: a tool, Lecturn, that associated geometric layout information with a bit image, allowing one to select and copy the text. It also comes with a companion tool, BuildLecturn, for converting PostScript files. It was exactly what I wanted. Almost. It has a few bugs, and is lacking a few features, and is written in Modula-3! I couldn't quite bring myself to invest the effort to fix something that would not be very portable.

On the bright side, part of BuildLecturn was ocr.ps, a tool very much like ps2ascii.ps, but much easier to read, smaller, and (I think) more robust in its methods. It was designed only to extract raw information from a PostScript file; it didn't even have a plain text mode. I decided to abandon my development of ps2ascii.ps and start building on ocr.ps (which is why I provide no links to my hacked ps2ascii.ps, which is not ready for release and never will be). I call the new tool pstext.ps.

The last prior art I have to mention is the one I've known about the longest, Multivalent Documents, from the Digital Library project. MVDs are already in the habit of selecting text from a bit image (besides doing many other nifty things with them), so it was sensible to try to convert PostScript to MVD.

I'm a hair's breadth from a truly working demo. I can convert at least one PostScript file to an MVD, except that the applet thinks that the words are all shifted down from where they really are, which makes selecting the text rather tricky. A few more hours...

Here are the files involved in the conversion:

pstext.ps version 0.1.0
This is unfinished, rough, flakey, not ready for prime time!
psttoxdoc.c
A filter that converts the output of pstext.ps into a minimalist XDOC file (XDOC is a Xerox format used for OCR output)
pstomvd
A Bourne shell script that invokes Ghostscript, pstext.ps, psttoxdoc, and some netpbm utilities in order to generate a Multivalent Document from a given PostScript file.
unix.ps.gz
This is the one PostScript file on which I have tested pstomvd. The PostScript file was generated by FrameMaker.

What I Learned

I learned PostScript and Java, which are very nice fringe benefits.

I learned that PostScript files contain even more information than I expected, namely, the order of the characters. The programs that generate the PostScript are under no obligation to draw the characters in any nice order, but they almost always do, which saves me a lot of two-dimensional sorting work.

On the other hand, by looking at the XDOC format I learned that there is a lot more potentially useful information to be reconstructed than I had anticipated. For example, the outlines of text regions, whether line breaks are hard or soft, whether a line is a heading, etc. I think perhaps recognizing characters is not the most difficult task of OCR software, despite the name, and that if you're willing to spend the time and money, you'll get what you pay for.

What I Would Do Differently

I would not wait to do all the work in the last couple of weeks. :)

What is Left to be Done

Someday I'd like to finish pstext.ps. It already has features that neither ocr.ps nor ps2ascii.ps have, like reporting individual glyph positions and glyph names, and support for Unicode, not just Latin-1 or ASCII. I'd also like to provide bounding boxes for indivual glyphs, which is tricky because some fonts (like those produced by dvips) use images instead of paths.

The next step will be to write a PostScript viewer that uses pstext.ps to allow the user to select and search for text.

Prepared by Adam M. Costello
Last modified: 1998-Sep-07-Mon 23:36:00 GMT