My PDF Workflow
#378 Henry, Friday, 22 February 2019 11:25 AM (Category: Linux)
(Tags: pdf scanning dropbox owncloud nextcloud readdle)

I have a workflow with all the paper that comes into my house - I scan it, I file it, I shred it (mostly). I end up with PDFs.

Might seem easy, but there are a lot of steps in the puzzle, a lot of tools, a lot of gadgets.

A few years ago, I read a book called The Ultimate iPad: Your Digital Life At Your Fingertips by James Floyd Kelly. This book got me started.


The book recommended a little scanner called the Doxie Go. The link takes you to the Doxie Go SE, as the Doxie Go that I have is not the latest product. Mine cost about $140. It's basically a small portable scanner, runs on rechargeable batteries, scans pages one by one, stores a lot of scans in internal memory, and you can extract those scans in multiple ways. You can use the Doxie software on the iPad or the Mac. With the Doxie software on the iPad, you can pull all the scans via wifi, or with the Mac you can pull them by USB or by wifi. Once you get them on the device, you can correct them, rotate them, staple them together, turn them into PDFs and save them. I save them to Dropbox.

Alternatively, I can mount the Doxie on my Linux box and extract the PDFs directly.

This worked really well for a while. I did it exclusively on the iPad for a while, but eventually ran into problems. The iPad software got updated a few times, and then it started glitching. It would save the original unrotated images into the PDFs, or not save, or other small things. I reported the issues to Doxie at first, and they would be corrected, but it seemed after a while that the iPad software got flakey. I switched to using the Doxie software on the Mac. This was easier, did not get updated as often (what a relief) and worked much better. In addition, I found that the PDFs created with their Mac software were much smaller than that created by the iPad software. So now I use the Mac software only.

There were some other issues. I never got the hang of feeding the pages perfectly straight every single time. Often, they were tilted slightly. And it was page by single page. It got to be a bit of a chore, and I would put off scanning for weeks at a time.

Then I saw a review of a scanner on Cool Tools - the Fujitsu ScanSnap iX500. Turns out this is the second scanner recommended in the Ultimate iPad book. But it costs $419. And there's a newer model - the Fujitsu ScanSnap iX1500. I read a lot of the reviews for the iX1500 and most were people lamenting that they upgraded to the newer model and saying they should have stuck with the older model. I read enough, so I bought the older model from Amazon. The old and the new model were both available for exactly the same price - $419. I bought it, and this tiny little box arrived. I was shocked how small it was.

I installed it and got it working and it's a wonderful neat little scanner. You can put a packet of paper in the hopper, hit the button, and the whole lot will be scanned, both sides scanned simultaneously, blank pages automatically omitted, pages automatically rotated if needed, all merged into a single PDF and saved to Dropbox with one or two clicks. It is amazing. The first time I scanned a batch of pages in and watched the action, I said "Multiple cursewords, look at that". I was shocked, it was such a fast, accurate, beautiful little scanner. It works so well. I totally love it. I got through 18 inches of my immigration documents in about 15 minutes. That's scanning, saving to Dropbox, and filing them. As a comparison, I could do a five page document with the Doxie, scanned, retrieved, combined, saved to Dropbox and then filed in about 4 minutes. Doing the same with the Fujitsu took about 7 seconds. It's a great little scanner. But it's not portable, and does not run off batteries.

I continue to use both scanners. The Doxie is great for flimsy receipts, scans of delicate items, and weird stuff. The Fujitsu is great for regular pages of documents and bills.

Very occasionally, I use Anne's flatbed scanner which is part of her all-in-one printer. Books, passports, things that won't go through a page scanner. So I'm using three scanners.

Transferring PDFs

Both the Doxie and the Fujitsu let me save to Dropbox. I have a free Dropbox account with about 4 gig of storage. I don't permanently store data there. I use Dropbox like a portal. I put data in at one end, and pull it out at the other. It's perfect for transferring data between systems. I put data in on my iPad, and pull it out on my Linux desktop. I put files in on my Linux desktop and pull them out on my Macs, my iPad, my iPhone, other Linux desktops. It's the most awesome tool.

Except Dropbox has been getting restrictive. I understand, they have to make money, and I'm on the free plan. I also feel a bit twitchy about storing financial documents there. So I created my own OwnCloud server, and started using that. It works like Dropbox, except I control the data repository on my own servers. There are iOS apps, Mac programs, and Linux programs for ownCloud. I use it now. I scan directly to the local ownCloud directory, and the ownCloud software distrubutes it to my devices.

So generally, I will scan documents and create PDFs on my iPad or Mac, push them into ownCloud, and pull them out on my Linux desktop.

But the original creator of ownCloud has abandoned it, and started a newer one called NextCloud. I should be looking at that too, except it requires a newer version of PHP to run, and I must wait till the next version of Slackware comes out with that version of PHP. When it does, I will surely investigate NextCloud.

Saving the data

My primary workspace is my Linux desktop. Command line work, storing and filing data in a file hierarchy. Old school. But the data is readily available, everything is consistently named so it's sorted, it's easy to use the regular Unix tools like ls, find and grep.

Once the data is transferred to my Linux desktop, it gets stored on the local hard disk. Occasionally I do some editing or manipulation, but mostly the PDFs are already complete and just need filing.

Each night, my local data is automatically backed up to my NAS. The data on my NAS is automatically backed up to an external hard drive. Once a week, the data on my NAS is backed up to a portable external hard drive, which is stored offsite. And there are a few other backups that occur as well. At any given time, I have it on my desktop, on my Nas, on about four external hard drives in various places. I have a phobia about losing data.

And I don't shred the original documents until it's filed and backed up. Then I shred. That's in case I have to go back to the original source and do it all over again.

Editing PDFs

Sometimes I need to edit the PDFs. This is where things get tricky. Some things I can do on my Linux desktop, some things are better done with a GUI tool on the Mac.

Joining PDFs

Often I end up with an invoice that I scanned with the Fujitsu scanner because it's so fast and easy, and a credit card receipt that I scanned with the Doxie because the Doxie scans these little things so much easier, and I don't want two PDFs, I want one that contains both the invoice and the receipt. Say for example, I have 20190121_rutledge_henry_invoice.pdf and 20190121_rutledge_henry_receipt.pdf. There are two Linux tools I can use to combine these into one PDF - pdfconcat and pdfunite.

pdfconcat works nicely, gives you lots of data about the process.

$ pdfconcat -o 20190121_rutledge_henry.pdf 20190221_rutledge_henry_invoice.pdf 20190221_rutledge_henry_receipt.pdf
Input PDF (20190221_rutledge_henry_invoice.pdf): filesize=598630, xrefc=16, xreftc=1, catalogofs=597881, #pages=1, is_binary=1
Input PDF (20190221_rutledge_henry_receipt.pdf): filesize=119191, xrefc=16, xreftc=1, catalogofs=118460, #pages=1, is_binary=1
Output PDF (20190121_rutledge_henry.pdf): filesize=717549, xrefc=32, subfiles=2, #pages=2, is_binary=1

The end result is usually smaller than the individual PDFs. It's a very tight, clean program.

pdfunite is almost the same, but has no output if things work well.

$ pdfunite 20190221_rutledge_henry_invoice.pdf 20190221_rutledge_henry_receipt.pdf 20190221_rutledge_henry.pdf

Both work about the same, both are fast and smooth and tight, and I use them interchangeably.

Removing passwords

Occasionally I get financial documents that are password protected. I generally remove the password before filing them. To do this, I use an iPad app from Readdle called PDF Expert. It cost me $9.99 for the basic PDF viewer, and then another $9.99 in-app purchase to get the ability to edit the PDF. This tool will let me view the PDF after entering the password, and then let me change the password or delete the password. It works, it's neat, it was worth it.

I also bought Readdle's PDF Expert For Mac. The price tag was high, $79. I bought it at that price. The new version 7 was released mid-2019, and it is a subscription model, with free download for basic functions and annual subscription fees for the specialist functions. I am hoping my current abilities have been grandfathered in.

I have been using the Mac version of this program a lot. I scan my documents into my ownCloud directory on my Mac, then edit them immediately with PDF Expert. I remove empty pages, rotate some pages, and remove passwords. I didn't expect to use it so much, but I find I use it a lot. It's a good program and does a good job.

Extracting images

Occasionally I have a PDF that I have scanned and I want to get some of the images out of it so I can do things with them. I might have scanned a whole Christmas card/letter combo which included some photos, and I want to store the photos elsewhere. I use the command line tool pdfimages to extract the scanned images.

$ pdfimages 20190121_rutledge_henry.pdf image
$ ls

This will extract images in the original format. In this example, the scanner created ppm files and they were extracted as is. I can specify the first and last PDF page to extract images from with -f and -l. I can convert the ppm files to another format like jpg with the ImageMagick tools like "convert".

Splitting a PDF up

Sometimes I make mistakes and end up with a PDF that is missing a page, or has blank pages that need to be removed, or a page upside-down, or any other sort of problem. I now use Readdle's PDF Expert on Mac to edit these. It's easier to do it with the GUI. But sometimes I want to do the editing under Linux. I don't do it often, so I forget the commands, so I have included them here for the sake of my memory.

The first step is to break the PDF into individual pages. I use pdfseparate for this.

$ pdfseparate 20190221_rutledge_henry.pdf rhpage_%d.pdf
$ ls

Now I can put them back together using pdfconcat or pdfunite, but with extras. If I have scanned an extra page that needs to be included, I can do it like this:

$ pdfconcat -o new.pdf rhpage_1.pdf extrapage.pdf rhpage_2.pdf

and I will end up with a three page PDF with the extra page in the middle.

Or I could replace page 2 with the extra page like this:

$ pdfconcat -o new.pdf rhpage_1.pdf extrapage.pdf

If a page is upside down and needs to be fixed, there are a number of ways I can modify it. The simplest way is to split the PDF into individual PDF pages, convert the offending page to an image, edit the image, convert the edited image back to PDF, then create a single PDF from the individual PDFs. For example:

$ pdfseparate 20190221_rutledge_henry.pdf rhpage_%d.pdf
$ pdfimages rhpage_2.pdf page2image
$ mogrify -rotate 180 page2image-000.ppm
$ convert page2image-000.ppm newpage2.pdf
$ pdfconcat -o merged.pdf page1.pdf newpage2.pdf

There are lots of ways of doing this on the command line. This is detailed finicky work, doing it on the command line. That's why I started using Readdle's PDF Expert, just to get the job done quickly.

Constructing a PDF

I don't often do this with documents. Sometimes I want to create a PDF from scratch, like a poster or a game sheet or a timesheet. I use a Python library called ReportLab, and write Python that generates the PDF. I've done this a few times and it's very satisfying. The PDFs are tiny in size, but crisp and very nice. Even though I am writing code to generate the PDF, it takes far less time to write my own than creating a form in LibreOffice and generating a PDF from that.

Viewing PDFs

On my Linux desktop, I used to use the Adobe acroread tool to view and print PDFs. But I recently moved all my desktops and servers to 64 bit Slackware, and acroread no longer works. I had to find alternatives.

xpdf is old and venerable, but it stays up to date and it works with almost all PDFs.

gv is part of the GNU suite, It works with a lot of PDFs, has a beautiful old-world feel to it. I don't like the navigation though.

okular is a modern part of KDE, and it is quite nice. It handles a lot more file types than just PDF. It doesn't display some of my bank's PDFs though.

mupdf is a very choppped down PDF viewer, very fast, no menu bars, just the PDFs and no frills like menus, and it uses Vi style keystrokes for navigation. It's fast and I like it. It also handles epubs, so when I want rapid navigation with Vi-style, I use mupdf for epub too. I use it on the command line a lot, and it throws off a lot of errors and warnings, so I've wrapped it in a script that sends all that stuff to /dev/null. As a bonus, it will also view epubs.

In the end, I use mupdf for quick and dirty viewing of PDFs, and Okular for printing PDFs. So mupdf is in my Midnight Commander mc.ext extension handler, and in my .mailcap file, and I will manually start Okular on the command line. I'm always on the command line anyway, so it's no big deal.


Once, I used to keep all documents after I scanned them. No more. I keep some that I need to keep, like tax and immigration documents. But everything else gets shredded once scanned and saved.

We are on our fourth shredder. I have burnt out three previous shredders. We used to get good shredders from local stores like OfficeMax, but after the last one, we just gave in and ordered an Amazon Basics Shredder for about $85 and it's been excellent. It does a serious cross-cut shred on everything.