My PDF Workflow
#378 Henry, Friday, 22 February 2019 11:25 AM (Category: Linux)
(Tags: )

I have a workflow with all the paper that comes into my house - I scan it, I file it, I shred it (mostly). I end up with PDFs.

Might seem easy, but there are a lot of steps in the puzzle, a lot of tools, a lot of gadgets.

A few years ago, I read a book called The Ultimate iPad: Your Digital Life At Your Fingertips by James Floyd Kelly. This book got me started.

Scanning

The book recommended a little scanner called the Doxie Go. The link takes you to the Doxie Go SE, as the Doxie Go that I have is not the latest product. Mine cost about $140. It's basically a small portable scanner, runs on rechargeable batteries, scans pages one by one, stores a lot of scans in internal memory, and you can extract those scans in multiple ways. You can use the Doxie software on the iPad or the Mac. With the Doxie software on the iPad, you can pull all the scans via wifi, or with the Mac you can pull them by USB or by wifi. Once you get them on the device, you can correct them, rotate them, staple them together, turn them into PDFs and save them. I save them to Dropbox.

This worked really well for a while. I did it exclusively on the iPad for a while, but eventually ran into problems. The iPad software got updated a few times, and then it started glitching. It would save the original unrotated images into the PDFs, or not save, or other small things. I reported the issues to Doxie at first, and they would be corrected, but it seemed after a while that the iPad software got flakey. I switched to using the Doxie software on the Mac. This was easier, did not get updated as often (what a relief) and worked much better. In addition, I found that the PDFs created with their Mac software were much smaller than that created by the iPad software. So now I use the Mac software only.

There were some other issues. I never got the hang of feeding the pages perfectly straight every single time. Often, they were tilted slightly. And it was page by single page. It got to be a bit of a chore, and I would put off scanning for weeks at a time.

Then I saw a review of a scanner on Cool Tools - the Fujitsu ScanSnap iX500. Turns out this is the second scanner recommended in the Ultimate iPad book. But it costs $419. And there's a newer model the Fujitsu ScanSnap iX1500. I read a lot of the reviews for the iX1500 and most were people lamenting that they upgraded to the newer model and saying they should have stuck with the older model. I read enough, so I bought the older model from Amazon. The old and the new model were both available for exactly the same price - $419. I bought it, and this tiny little box arrived. I was shocked how small it was.

I installed it and got it working and it's a wonderful neat little scanner. You can put a packet of paper in the hopper, hit the button, and the whole lot will be scanned, both sides scanned simultaneously, blank pages automatically omitted, pages automatically rotated if needed, all merged into a single PDF and saved to Dropbox with one or two clicks. It is amazing. The first time I scanned a batch of pages in and watched the action, I said "Multiple cursewords, look at that". I was shocked, it was such a fast, accurate, beautiful little scanner. It works so well. I totally love it. I got through 18 inches of my immigration documents in about 15 minutes. That's scanning, saving to Dropbox, and filing them. As a comparison, I could do a five page document with the Doxie, scanned, retrieved, combined, saved to Dropbox and then filed in about 4 minutes. Doing the same with the Fujitsu took about 7 seconds. It's a great little scanner. But it's not portable, and does not run off batteries.

I continue to use both scanners. The Doxie is great for flimsy receipts, scans of delicate items, and weird stuff. The Fujitsu is great for regular pages of documents and bills.

Transferring PDFs

Both the Doxie and the Fujitsu let me save to Dropbox. I have a free Dropbox account with about 4 of storage. I don't permanently store data there. I use Dropbox like a portal. I put data in at one end, and pull it out at the other. It's perfect for transferring data between systems. I put data in on my iPad, and pull it out on my Linux desktop. I put files in on my Linux desktop and pull them out on my Macs, my iPad, my iPhone, other Linux desktops. It's the most awesome tool. I feel guilty for not paying them any money. I depend on it so much for data transfer that I should pay them something. I store a few small things in Dropbox permanently, but use it mostly as a portal. I could continue as I am now and use it for free, but how long will that continue? Or I could go to Dropbox Basic for $8.25 a month or Professional for $16.58 a month, or go to Dropbox Business Standard for $12.50 a month. All have their advantages with the amount of space you can store, and extra features. Yet I don't want to store my data there, and I don't want the extra features, so I don't want these models. If they had something that let me continue doing what I do, using it as a portal, and charged $10 a year, or even $20 a year, then I would happily pay that. Until that comes along, I'll continue to use it for free.

So generally, I will scan documents and create PDFs on my iPad or Mac, push them into Dropbox, and pull them out on my Linux desktop.

Saving the data

My primary workspace is my Linux desktop. Command line work, storing and filing data in a file hierarchy. Old school. But the data is readily available, everything is consistently named so it's sorted, it's easy to use the regular Unix tools like ls, find and grep.

Once the data is transferred to my Linux desktop, it gets stored on the local hard disk. Occasionally I do some editing or manipulation, but mostly the PDFs are already complete and just need filing.

Each night, my local data is automatically backed up to my NAS. The data on my NAS is automatically backed up to an external hard drive. Once a week, the data on my NAS is backed up to a portable external hard drive, which is stored offsite. And there are a few other backups that occur as well. At any given time, I have it on my desktop, on my Nas, on about four external hard drives in various places. I have a phobia about losing data.

And I don't shred the original documents until it's filed and backed up. Then I shred. That's in case I have to go back to the original source and do it all over again.

Editing PDFs

Sometimes I need to edit the PDFs. This is where things get tricky. Some things I can do on my Linux desktop, and this is where it's easiest. In general, I want a command line tool, not a GUI tool.

Joining PDFs

Often I end up with an invoice that I scanned with the Fujitsu scanner because it's so fast and easy, and a credit card receipt that I scanned with the Doxie because the Doxie scans these little things so much easier, and I don't want two PDFs, I want one that contains both the invoice and the receipt. Say for example, I have 20190121_rutledge_henry_invoice.pdf and 20190121_rutledge_henry_receipt.pdf. There are two Linux tools I can use to combine these into one PDF - pdfconcat and pdfunite.

pdfconcat works nicely, gives you lots of data about the process.

$ pdfconcat -o 20190121_rutledge_henry.pdf 20190221_rutledge_henry_invoice.pdf 20190221_rutledge_henry_receipt.pdf
Input PDF (20190221_rutledge_henry_invoice.pdf): filesize=598630, xrefc=16, xreftc=1, catalogofs=597881, #pages=1, is_binary=1
Input PDF (20190221_rutledge_henry_receipt.pdf): filesize=119191, xrefc=16, xreftc=1, catalogofs=118460, #pages=1, is_binary=1
Output PDF (20190121_rutledge_henry.pdf): filesize=717549, xrefc=32, subfiles=2, #pages=2, is_binary=1

The end result is usually smaller than the individual PDFs. It's a very tight, clean program.

pdfunite is almost the same, but has no output if things work well.

$ pdfunite 20190221_rutledge_henry_invoice.pdf 20190221_rutledge_henry_receipt.pdf 20190221_rutledge_henry.pdf

Both work about the same, both are fast and smooth and tight, and I use them interchangeably.

Removing passwords

Occasionally I get financial documents that are password protected. I generally remove the password before filing them. To do this, I use an iPad app from Readdle called PDF Expert. It cost me $9.99 for the basic PDF viewer, and then another $9.99 in-app purchase to get the ability to edit the PDF. This tool will let me view the PDF after entering the password, and then let me change the password or delete the password. It works, it's neat, it was worth it.

I looked at buying Readdle's PDF Expert For Mac, but demurred at the $79 price tag. I've been spoiled by the reduced prices of iOS apps.

Extracting images

Occasionally I have a PDF that I have scanned and I want to get some of the images out of it so I can do things with them. I might have scanned a whole Christmas card/letter combo which included some photos, and I want to store the photos elsewhere. I use the command line tool pdfimages to extract the scanned images.

$ pdfimages 20190121_rutledge_henry.pdf image
$ ls
image-000.ppm
image-001.ppm

This will extract images in the original format. In this example, the scanner created ppm files and they were extracted as is. I can specify the first and last PDF page to extract images from with -f and -l. I can convert the ppm files to another format like jpg with the ImageMagick tools like "convert".

Splitting a PDF up

Sometimes I make mistakes and end up with a PDF that is missing a page, or has blank pages that need to be removed, or a page upside-down, or any other sort of problem. There are GUI editors on other systems that let me fix these. But I still prefer to do the editing under Linux. The first step is to break the PDF into individual pages. I use pdfseparate for this.

$ pdfseparate 20190221_rutledge_henry.pdf rhpage_%d.pdf
$ ls
rhpage_1.pdf
rhpage_2.pdf

Now I can put them back together using pdfconcat or pdfunite, but with extras. If I have scanned an extra page that needs to be included, I can do it like this:

$ pdfconcat -o new.pdf rhpage_1.pdf extrapage.pdf rhpage_2.pdf

and I will end up with a three page PDF with the extra page in the middle.

Or I could replace page 2 with the extra page like this:

$ pdfconcat -o new.pdf rhpage_1.pdf extrapage.pdf

If a page is upside down and needs to be fixed, there are a number of ways I can modify it. The simplest way is to split the PDF into individual PDF pages, convert the offending page to an image, edit the image, convert the edited image back to PDF, then create a single PDF from the individual PDFs. For example:

$ pdfseparate 20190221_rutledge_henry.pdf rhpage_%d.pdf
$ pdfimages rhpage_2.pdf page2image
$ mogrify -rotate 180 page2image-000.ppm
$ convert page2image-000.ppm newpage2.pdf
$ pdfconcat -o merged.pdf page1.pdf newpage2.pdf

There are lots of ways of doing this on the command line. I'm happy there, so I do this and get the precision I want. You might be happier paying money for a GUI tool that makes it easier.

Constructing a PDF

Viewing PDFs

On my Linux desktop, I used to use the Adobe acroread tool to view and print PDFs. But I recently moved all my desktops and servers to 64 bit Slackware, and acroread no longer works. I had to find alternatives.

xpdf is old and venerable, but it stays up to date and it works with almost all PDFs.

gv is part of the GNU suite, It works with a lot of PDFs, has a beautiful old-world feel to it. I don't like the navigation though.

okular is a modern part of KDE, and it is quite nice. It handles a lot more file types than just PDF. It doesn't display some of my bank's PDFs though.

mupdf is a very choppped down PDF viewer, very fast, no menu bars, just the PDFs and no frills like menus, and it uses Vi style keystrokes for navigation. It's fast and I like it. It also handles epubs, so when I want rapid navigation with Vi-style, I use mupdf for epub too. I use it on the command line a lot, and it throws off a lot of errors and warnings, so I've wrapped it in a script that sends all that stuff to /dev/null.

In the end, I use mupdf for quick and dirty viewing of PDFs, and Okular for printing PDFs. So mupdf is in my Midnight Commander mc.ext extension handler, and in my .mailcap file, and I will manually start Okular on the command line. I'm always on the command line anyway, so it's no big deal.

Shredding

Once, I used to keep all documents after I scanned them. No more. I keep some that I need to keep, like tax and immigration documents. But everything else gets shredded once scanned and saved.

We are on our fourth shredder. I have burnt out three previous shredders. We used to get good shredders from local stores like OfficeMax, but after the last one, we just gave in and ordered an Amazon Basics Shredder for about $85 and it's been excellent. It does a serious cross-cut shred on everything.

0 comments