This is a guest post by Taylor Jardno, a second-year doctoral student in Yale University’s Department of History. Her current work focuses on youth militancy and comic book culture in Argentina. She can be contacted at taylor [dot] jardno [at] yale [dot] edu.

After the first research recon trip of my graduate school career, I found myself with over 3,000 digital images and the pressing problem of how to organize them. I decided the best way to attack the issue would be by creating multi-page, searchable PDF files from my pictures. There seems, however, to be no authoritative guide to digitizing documents, which is why I’ve documented my trials here.

Warnings and disclaimers: This is still a work-in-progress…I plan to update as I keep improving the process. The software here can be downloaded in trial versions, most of which are good for 30 days, though sometimes with limited capabilities. If you love the software, especially programs made by smaller companies, buy it.

Book Restorer

The first program I came upon in some DIY scanning forums is BookRestorer by is2 Digibook. This program has a very old (like pre-Windows 95) graphic interface and its heavy processing requirements. What’s great about this program is that it’s specifically for scanned texts and has the ability to correct distortions made by the scanning process by correcting the geometric curve of a not-completely-flat book. Apart from this feature, I don’t think BookRestorer is worth it for the average graduate student. Its other corrective capabilities are limited; for example, the black and white (B/W) filter is unpredictable and not customizable, and undoing one feature will undo everything. Lastly, and probably most importantly, I just stumbled across a librarian forum that made mention of a $5000-$6000 price range.

Pros: Designed to manipulate and correct scanned text; processes batch commands
Very limited trial version; interface not user-friendly; Windows only; costs more than your car

Adobe Lightroom

Unsatisfied with BookRestorer, I moved on to a trial of Adobe Lightroom, which came with great reviews from some of my colleagues. The program feels like a mix between Aperture, iPhoto and an idiot-proof Photoshop with easy options and good results. One of Lightroom’s brilliant features is that the edits you make are not saved to your original photo file, which remains intact. You have to export your edited images as new files. Though this can eat through your hard drive space, it might save you from making a really, horrible, terrible mistake. This “ghost processing” also means that your images stay in their original locations and are not regrouped and hidden in a single, impenetrable place (like iPhoto’s library). The program is also a pretty good deal ($99 for students and educators).

Most outstanding about this program is the incredible ability to apply settings in a batch to all your files. This means if you tweak the contrast, brightness, clarity, boldness of blacks, etc. etc. etc. on one page/photo, you can select your other pages, hit the “Sync” button and the program will match these levels throughout the batch. Additionally, Lightroom’s controls for cropping and distortion correction are also fairly easy to use. As you play with sliding scales for features like distortion (which balloons in or out) and vertical, horizontal and rotational transformations, Lightroom places an opaque grid over your image to help with alignment.

In the end, I can only mention three negatives: it’s really cumbersome netbooks (though great on the Mac), there’s no dedicated panel or set of controls for images with text, and I’m still left with JPG files…no PDF.

Pros: User-friendly interface; batch processing; affordable educational license; original files preserved while editing
Runs slowly on netbooks; no text-specific adjustment settings; no PDF output

ABBYY FineReader Professional

Another friend/colleague turned me onto ABBYY FineReader, an application also popular in scanning/DIY forums. The makers bill the FineReader as an OCR (Optical Character Recognition – a process that turns your PDF/image’s text into selectable fields that you can copy-and-paste into other text editing programs), but I’ve found that the program is great for editing and manipulating images as well.

Upon opening the program, a menu gives you several choices of “tasks,” through which you can convert files into a searchable PDF. Before you click, however, you must select a language (actually, up to three at a time from a long list of commonly read and spoken languages) for text recognition. As it processes, the program cleans up the images and rotates them into correct position. On my netbook, this process took around an hour for 60-70 pages and on my MacBook Pro (running Windows 7 via VMWare Fusion) around 35-40 minutes for 90 pages.

After processing, you have a chance to edit the pages before exporting to PDF. Clicking the “edit image” button will allow you to manipulate each page individually or, like Lightroom, in a batch. My favorite feature is the “correct trapezium distortion” option, which allows you to drag a re-sizable box (like a crop box) around the portions of the page you’d like to keep. FineReader then analyzes the angle of the text and page and, assuming that you took the photo at a slight angle that exaggerated the bottom of the page and narrowed the top (like taking a picture of a tall building while standing at its base), magically corrects everything. While some of the text might occasionally be a bit “off” in appearance (though still readable), I think on average it outperforms the manual controls found in Lightroom. You can also opt to “remove noise” from a picture taken by a shaky hand and “straighten text lines,” my second favorite feature of the program.

The downside to this program is that there is no way to adjust the brightness, contrast, and other general features of each page’s appearance. For example, I find that applying a B/W filter makes the pages more readable and will be better than color if I ever have to print the document. Another negative is its $199 price tag (no student discount), though the demo version is available for 15 days and allows 50 pages per batch process.

Pros: Converts files with text in images (JPGs, TIFFs, PDFs, etc.) to searchable PDFS seamlessly; specific controls for straightening, flattening, and correcting the angles of text; batch processing;
Windows only (Mac version does not have image editing capabilities, just OCR conversion); no controls for adjusting color/contrast; pricey; needs to be used in tandem with a photo editor for more advanced adjustments

The Kludge

Here’s my tentative two program, three step process: FineReader, Lightroom, FineReader. 1.) I import my files to FineReader and occupy myself for 45 minutes while it processes. Then I edit each image individually for shape/position, then batch edit to straighten text lines and deskew. 2.) Next, I export the image files in a new folder and import this folder into Lightroom. Here I apply in a batch process a customized B/W filter. I check the images out, make sure everything is readable, then export back to the folder I previously created. 3.) Last, I reimport to FineReader and save as a PDF. I open the image up in Preview to check everything out again and test the search capability. If I’m satisfied with the results, I ditch the exported image files (I keep all my originals on an external drive) and I’m left with a PDF file about 50MB in size for 100 pages. If you’re a Zotero-ite but are worried about eating up your free space, the file sizes can be reduced with other programs, but that’s for another post.

Do you have a method for digitizing archival work? Describe your process in the comments!

[Image by Flickr user Chris Reichenecker and used under the Creative Commons license]



4 Responses to Review Guide: Software for Digital Image Archiving

  1. GradHacker says:

    This has been a very useful guide to think about how I will organize and store my summer research. The trick will be doing this as I read them rather than saving it until later. ¡Gracias!

  2. Kevin says:

    For batch editing document scans, you may want to try the open source post-processor Scan Tailor:

    I’ve had very good luck with it. It’s free and there are Linux and Windows versions.

  3. Enrique says:

    Theres also the free option.

    Scan Tailor is really quite amazing. Its made for photographs of documents, and its all very automated. But this is only useful if one wants really clean, b/w and straight pages.

    To keep the texture of the archival material, i would just import images to gscan2pdf, to OCR (the only step that takes a while and usually works) compile, name and export as pdf and compress if neccessary.

    50page file around 5-10mb for zoteroites. 70dpi is good enough for screenreading.

    OCR is useful for later Zotero searching, which trawls through the pdfs as well, great if its a long lists of names, that at some point might be useful to track down.

  4. Mac Adobe says:

    Hello my friend! I want to say that this article is awesome, nice written and include approximately all significant infos. I’d like to see more posts like this .

Leave a Reply to Mac Adobe Cancel reply

Your email address will not be published. Required fields are marked *

.post-thumb {float: left;}