# PuritanHD Question: Software to split pages and recognise words?



## Moireach (May 24, 2012)

I'm asking this question in relation to scanned books in the Puritan Hard Drive. Does anyone know of good software to split scanned book pages that show 2 pages in 1? And also OCR software to recognise text?

The latter is easy to find really it's just a question of which is best value, the latter is very difficult to find.

Thanks


----------



## VictorBravo (May 24, 2012)

I take it they are pdf scans. If so, most pdf editing tools can crop the pdf image and also perform OCR. The most venerable tool is Adobe Acrobat. Omnipage Pro can do it, and if you search for pdf tools that crop, you may be able to find something less expensive.

Omnipage has OCR, as do many other pdf editors.

But it would be tedious. First you crop one-half of the page and save it, then you go back to your original and crop the second half and save it as a different file, page by page, and then, (if using Ominpage or Acrobat) you can combine the files into one pdf.

The other approach is to do an OCR of the entire book and copy/paste the results into a text editor, or, if you have some kind of pdf editor, save as a Word doc or rtf. Note well that OCR is far from perfect and you will get weird artifacts.

Finally, before doing any of this, check to make sure the copyright statement does not prohibit you from doing this on a large scale. That's a whole other issue, but fair use usually means you can do this for personal purposes, but you probably are not allowed to distribute them. (Speaking here of American law--I have no idea what the rules are in Scotland).

I'm quite surprised the Puritan Hard Drive is not at least OCRed. I had been led to believe that it was text-searchable, which requires OCRed pdfs.


----------



## Moireach (May 24, 2012)

VictorBravo said:


> I take it they are pdf scans. If so, most pdf editing tools can crop the pdf image and also perform OCR. The most venerable tool is Adobe Acrobat. Omnipage Pro can do it, and if you search for pdf tools that crop, you may be able to find something less expensive.
> 
> Omnipage has OCR, as do many other pdf editors.
> 
> ...




Very helpful, thanks very much!

You'd think it would be OCRed. But to be fair it would take a lot of staff and a lot of time. There are so many books. But the majority of books on it are just scanned and that's it. So many are squint too. 

The copy and paste idea is one I'd have never thought of! Hopefully the OCR is accurate enough for it to work reasonably well!

There's no way I'm going to go through that much effort to split the PDF's into separate pages, copying and pasting might bypass the whole conundrum!

Thanks again.


----------



## Moireach (May 24, 2012)

VictorBravo said:


> I take it they are pdf scans. If so, most pdf editing tools can crop the pdf image and also perform OCR. The most venerable tool is Adobe Acrobat. Omnipage Pro can do it, and if you search for pdf tools that crop, you may be able to find something less expensive.
> 
> Omnipage has OCR, as do many other pdf editors.
> 
> ...




Very helpful, thanks very much!

You'd think it would be OCRed. But to be fair it would take a lot of staff and a lot of time. There are so many books. But the majority of books on it are just scanned and that's it. So many are squint too. 

The copy and paste idea is one I'd have never thought of! Hopefully the OCR is accurate enough for it to work reasonably well!

There's no way I'm going to go through that much effort to split the PDF's into separate pages, copying and pasting might bypass the whole conundrum!

Thanks again.


----------



## wraezor (May 24, 2012)

Two quick assumptions:
1) You can search the Puritan HD for keywords, though you still see the scanned image, much like Google Books. http://www.affiliates.puritan-hard-...fic Words and Phrases in Searchable Books.pdf
2) This is on a small-scale project (perhaps one or a couple books), since if it were easy and/or cheap to do large-scale, it would've been done by now.

Dealing with PDFs like the Puritan HD (and the CD sets that preceded it), you are dealing with what are actually full-page images more-so than traditional PDFs (which are most often primarily text-based). As such, you would have much better success (and more features/lower cost) converting it and dealing with it as images (PDF to TIFF etc). That way you can massage the pages/images before printing, OCRing, etc. using tools that are designed for image handling. PDF-centric tools generally are not.

In the work I've done processing and OCRing documents like this, command line tools were my friend. I would convert the document from PDF to individual TIFF files (using ImageMagick). I could then resize, crop/split images/pages, adjust contrast (to improve OCR) to my liking. If its not possible to automate, I would use something like xnView to quickly look at and edit each page image. Once you're ready to OCR, the defacto standard in OCRing antiquated books is and has been ABBYY FineReader for 10+ years.

If you want to talk to the experts about this stuff, get on the Project Gutenberg / PGDP forums. Distributed Proofreaders :: Index They deal with these issues on a daily basis. If there was an easy way to do it, they would've found it. As it is, they have 4, 5, 6 sets of eyes on each page before it gets onto Project Gutenberg.

Unless extensive rigor is put into the quality of an OCR, I, for one, appreciate the retention of scan images that SWRB does. We don't need the same level of critical eye that we do to Biblical manuscripts, but at the same time, some people who have done reckless OCRing have done a great disservice to the original works. Having a scan of the original book lends much to its credibility.

Some tools I like:
http://www.pdfforge.org
Pdftk - The PDF Toolkit
http://www.imagemagick.org
XnView Software - Free graphic and photo viewer, converter, organizer
OCR software for text recognition OCR PDF features - ABBYY FineReader (the only non-free one on the list)

(Full disclosure: Former SWRB employee, and separately former book publisher)


----------

