Bauteile-Verwaltung part-db

miscellaneous stuff

Converting double-page scans from a book into a useful PDF ebook

This is the situation:
  • you got your hands on some rare book and for your personal use you want to create a nice ebook from it
  • you scan the book into (multiple) PDF files with the two facing pages on one page in the PDF files
Here, we will relying solely on open source tools to get it done. NO COMMERCIAL closed-source tools.

I'm using Kubuntu 12.04 LTS, so the versions of the programs I used are

  • gimp 2.6.12-1ubuntu1
  • ghostscript 9.05~dfsg-0ubuntu4
  • pdftk 1.44-4build1

Obtain PDFs with only left or right pages

Having the two facing pages together on one PDF page we need a solution to crop this PDF page to seperate the facing pages again (a process we will refer to as cropping). We to this using a tool called "Ghostscript". First of all, note that in PostScript dimensions are given in points, with 1 inch = 72 points. To specify the so-called CropBox we give a vector of four values with the order [left bottom right top]. The origin of the PostScript coordinate system is the lower left corner of the page. To find out the right values I opened the scanned PDF file in gimp (a well-established open source image processing program) and switched units to points (pt). Moving around the cursor I wrote down the following values
left page326765920
right page54467611040
Note that there is some overlap where the binding of the scanned book used to be. So the left pages extend somewhat into the right page and vice-versa. This makes the cropping process more robust against small positioning errors of the book on the scanner.

In gimp the origin of the coordinate system is on the upper left corner, so to be useful for PostScript one has to substract these values from the page height of 707 pt, resulting in this table
left page3231592707
right page544311104707
Now let's fire up Ghostscript and create two PDFs -- one for the left pages and one for the right pages

gs -o left.pdf -sDEVICE=pdfwrite -c "[/CropBox [32 31 592 707] /PAGES pdfmark" -f thescan.pdf
gs -o right.pdf -sDEVICE=pdfwrite -c "[/CropBox [544 31 1104 707] /PAGES pdfmark" -f thescan.pdf
The above always worked for me, but in some rare cases you might need other solutions. Check out this source for more information.

Assembly of the final PDF

The exact procedure depends on if you have just one PDF with scanned pages or multiple PDFs with scans.

Only one scan PDF

For successful duplex-printing of your PDF the first page has to be on the right (in the finished "book"). So we have to remove the first page in the PDF containing only the left pages. The last page in the PDF containing the right pages has to be removed to avoid having blank left pages in the final book.

When you think about it for a minute or so you will realize that no pages are lost: Consider a typical book. On the very first page of your scanned PDF the left page is always blank, only the right one is useful. And the last filled page will always be a left page.
what we havewhat we want
Assuming that our PDF has 21 pages in total, our desired operation is carried out with the two commands.

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=left2.pdf -dFirstPage=2 left.pdf
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=right2.pdf -dLastPage=20 right.pdf

What still has to be done is to combine/interleave these two files, the first page being from the left PDF. I did this by doing

mkdir x
cd x
pdftk ../left2.pdf burst output %04d_B.pdf
pdftk ../right2.pdf burst output %04d_A.pdf
pdftk *.pdf cat output out.pdf
This requires some explanation. The first two lines create an empty scratch directory for the following commands. Any existing pdf files in the directory would interfere with what we are going to do! The first pdftk command writes all pages in the PDF left2.pdf into files with names 0001_B.pdf, 0002_B.pdf, ..., 0020_B.pdf (here left2.pdf has 20 pages total). The last pdftk command then compiles the final PDF file ready for print. The trick is that *.pdf sorts the pages like 0001_A.pdf (first page from the right PDF), 0001_B.pdf (first page left PDF), 0002_A.pdf (second page right PDF), 0002_B.pdf (second page left PDF), ...

The file out.pdf should be what you want, ready for duplex-printing.

Multiple scan PDFs

Begin with cropping the individual PDF files. Usually you will need different CropBox settings. This results in multiple PDFs containing the left and right pages of the different input files. I assume you called them left1.pdf, right1.pdf, left2.pdf, right2.pdf, ... Hence the A and the B in the filenames ensure that the right page (with the A) comes first. After cropping the six scan PDF files, my working directory looks like:
cl@clnb:/tmp/y$ ls -l
total 198424
-rw-rw-r-- 1 cl cl 13306002 Jun 24 14:03 left1.pdf
-rw-rw-r-- 1 cl cl  8775308 Jun 24 15:59 left2.pdf
-rw-rw-r-- 1 cl cl 20898172 Jun 24 16:03 left3.pdf
-rw-rw-r-- 1 cl cl 18837880 Jun 24 16:08 left4.pdf
-rw-rw-r-- 1 cl cl 14141272 Jun 24 16:12 left5.pdf
-rw-rw-r-- 1 cl cl 25616583 Jun 24 16:14 left6.pdf
-rw-rw-r-- 1 cl cl 13306004 Jun 24 14:04 right1.pdf
-rw-rw-r-- 1 cl cl  8775310 Jun 24 15:59 right2.pdf
-rw-rw-r-- 1 cl cl 20898174 Jun 24 16:03 right3.pdf
-rw-rw-r-- 1 cl cl 18837882 Jun 24 16:08 right4.pdf
-rw-rw-r-- 1 cl cl 14141274 Jun 24 16:11 right5.pdf
-rw-rw-r-- 1 cl cl 25616585 Jun 24 16:14 right6.pdf
This time we have to remove the first page of left1.pdf and the last page of right6.pdf without disturbing the page order of the other pages. Plus we have to maintain the corrector ordering of the pages from all the PDF files!
mkdir x
cd x
pdftk ../left1.pdf burst output 1_%04d_A.pdf
pdftk ../left2.pdf burst output 2_%04d_A.pdf
pdftk ../left3.pdf burst output 3_%04d_A.pdf
pdftk ../left4.pdf burst output 4_%04d_A.pdf
pdftk ../left5.pdf burst output 5_%04d_A.pdf
pdftk ../left6.pdf burst output 6_%04d_A.pdf
pdftk ../right1.pdf burst output 1_%04d_B.pdf
pdftk ../right2.pdf burst output 2_%04d_B.pdf
pdftk ../right3.pdf burst output 3_%04d_B.pdf
pdftk ../right4.pdf burst output 4_%04d_B.pdf
pdftk ../right5.pdf burst output 5_%04d_B.pdf
pdftk ../right6.pdf burst output 6_%04d_B.pdf
The prefixes 1_, 2_, ... make sure that the pages from different PDFs aren't mixed up. Of course, when you have more than 9 scan PDFs you will need prefixes of the type 01_, 02_, ...

Delete the first left page (from the first scan PDF) and the last right page (from the last scan PDF):

rm 1_0001_A.pdf 6_0055_B.pdf
In your case the filename for the last page will be different, use ls to find it out!

What is left is the assembly of the final PDF file:

pdftk *.pdf cat output out.pdf