This is a guest post by Taylor Jardno, a second-year doctoral student in Yale University’s Department of History. Her current work focuses on youth militancy and comic book culture in Argentina. She can be contacted at taylor [dot] jardno [at] yale [dot] edu.
After the first research recon trip of my graduate school career, I found myself with over 3,000 digital images and the pressing problem of how to organize them. I decided the best way to attack the issue would be by creating multi-page, searchable PDF files from my pictures. There seems, however, to be no authoritative guide to digitizing documents, which is why I’ve documented my trials here.
Warnings and disclaimers: This is still a work-in-progress…I plan to update as I keep improving the process. The software here can be downloaded in trial versions, most of which are good for 30 days, though sometimes with limited capabilities. If you love the software, especially programs made by smaller companies, buy it.
The first program I came upon in some DIY scanning forums is BookRestorer by is2 Digibook. This program has a very old (like pre-Windows 95) graphic interface and its heavy processing requirements. What’s great about this program is that it’s specifically for scanned texts and has the ability to correct distortions made by the scanning process by correcting the geometric curve of a not-completely-flat book. Apart from this feature, I don’t think BookRestorer is worth it for the average graduate student. Its other corrective capabilities are limited; for example, the black and white (B/W) filter is unpredictable and not customizable, and undoing one feature will undo everything. Lastly, and probably most importantly, I just stumbled across a librarian forum that made mention of a $5000-$6000 price range.
Pros: Designed to manipulate and correct scanned text; processes batch commands
Cons: Very limited trial version; interface not user-friendly; Windows only; costs more than your car
Unsatisfied with BookRestorer, I moved on to a trial of Adobe Lightroom, which came with great reviews from some of my colleagues. The program feels like a mix between Aperture, iPhoto and an idiot-proof Photoshop with easy options and good results. One of Lightroom’s brilliant features is that the edits you make are not saved to your original photo file, which remains intact. You have to export your edited images as new files. Though this can eat through your hard drive space, it might save you from making a really, horrible, terrible mistake. This “ghost processing” also means that your images stay in their original locations and are not regrouped and hidden in a single, impenetrable place (like iPhoto’s library). The program is also a pretty good deal ($99 for students and educators).
Most outstanding about this program is the incredible ability to apply settings in a batch to all your files. This means if you tweak the contrast, brightness, clarity, boldness of blacks, etc. etc. etc. on one page/photo, you can select your other pages, hit the “Sync” button and the program will match these levels throughout the batch. Additionally, Lightroom’s controls for cropping and distortion correction are also fairly easy to use. As you play with sliding scales for features like distortion (which balloons in or out) and vertical, horizontal and rotational transformations, Lightroom places an opaque grid over your image to help with alignment.
In the end, I can only mention three negatives: it’s really cumbersome netbooks (though great on the Mac), there’s no dedicated panel or set of controls for images with text, and I’m still left with JPG files…no PDF.
Pros: User-friendly interface; batch processing; affordable educational license; original files preserved while editing
Cons: Runs slowly on netbooks; no text-specific adjustment settings; no PDF output
ABBYY FineReader Professional
Another friend/colleague turned me onto ABBYY FineReader, an application also popular in scanning/DIY forums. The makers bill the FineReader as an OCR (Optical Character Recognition – a process that turns your PDF/image’s text into selectable fields that you can copy-and-paste into other text editing programs), but I’ve found that the program is great for editing and manipulating images as well.
Upon opening the program, a menu gives you several choices of “tasks,” through which you can convert files into a searchable PDF. Before you click, however, you must select a language (actually, up to three at a time from a long list of commonly read and spoken languages) for text recognition. As it processes, the program cleans up the images and rotates them into correct position. On my netbook, this process took around an hour for 60-70 pages and on my MacBook Pro (running Windows 7 via VMWare Fusion) around 35-40 minutes for 90 pages.
After processing, you have a chance to edit the pages before exporting to PDF. Clicking the “edit image” button will allow you to manipulate each page individually or, like Lightroom, in a batch. My favorite feature is the “correct trapezium distortion” option, which allows you to drag a re-sizable box (like a crop box) around the portions of the page you’d like to keep. FineReader then analyzes the angle of the text and page and, assuming that you took the photo at a slight angle that exaggerated the bottom of the page and narrowed the top (like taking a picture of a tall building while standing at its base), magically corrects everything. While some of the text might occasionally be a bit “off” in appearance (though still readable), I think on average it outperforms the manual controls found in Lightroom. You can also opt to “remove noise” from a picture taken by a shaky hand and “straighten text lines,” my second favorite feature of the program.
The downside to this program is that there is no way to adjust the brightness, contrast, and other general features of each page’s appearance. For example, I find that applying a B/W filter makes the pages more readable and will be better than color if I ever have to print the document. Another negative is its $199 price tag (no student discount), though the demo version is available for 15 days and allows 50 pages per batch process.
Pros: Converts files with text in images (JPGs, TIFFs, PDFs, etc.) to searchable PDFS seamlessly; specific controls for straightening, flattening, and correcting the angles of text; batch processing;
Cons: Windows only (Mac version does not have image editing capabilities, just OCR conversion); no controls for adjusting color/contrast; pricey; needs to be used in tandem with a photo editor for more advanced adjustments
Here’s my tentative two program, three step process: FineReader, Lightroom, FineReader. 1.) I import my files to FineReader and occupy myself for 45 minutes while it processes. Then I edit each image individually for shape/position, then batch edit to straighten text lines and deskew. 2.) Next, I export the image files in a new folder and import this folder into Lightroom. Here I apply in a batch process a customized B/W filter. I check the images out, make sure everything is readable, then export back to the folder I previously created. 3.) Last, I reimport to FineReader and save as a PDF. I open the image up in Preview to check everything out again and test the search capability. If I’m satisfied with the results, I ditch the exported image files (I keep all my originals on an external drive) and I’m left with a PDF file about 50MB in size for 100 pages. If you’re a Zotero-ite but are worried about eating up your free space, the file sizes can be reduced with other programs, but that’s for another post.
Do you have a method for digitizing archival work? Describe your process in the comments!
[Image by Flickr user Chris Reichenecker and used under the Creative Commons license]
Tagsalt-ac anxiety Campus Resources classroom dynamic conferences depression disability dissertation evernote family food fun Google+ grading Health inspiration interdisciplinary job market job search meditation mental health motivation networking Organization parenting personal productivity professional professionalism professionalization research semester break Social Networking software stress students syllabus teaching technology tools Twitter wellness workflow work flow writing