How to Use Tesseract OCR to Convert PDFs to Text

I have some PDFs which I need to get typed up into text to edit. I decided to go with Tesseract OCR as it seems to be the best tool for the job. Here are the steps for how to use Tesseract OCR to convert PDFs to text.

Installation

First things first, get Tesseract CLI installed. Follow the instructions here, these are linked to from the official Tesseract docs.

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get update
sudo apt install tesseract-ocr tesseract-ocr-eng

Note: the package didn’t properly place the eng.traineddata file for me. If you get an error about this refer to the troubleshooting steps at the bottom of this article.

Usage

In the CLI, cd into the directory with the images or PDFs you want to convert.

Remember, Tesseract cannot convert PDFs, so first we must convert the PDF to a .tiff file, then we can convert the .tiff to text.

#Convert the PDF to a .tiff file, change out the file names at the end of this command to your own
#Note: If you get an error about security policy check the troubleshooting section below
convert -fill white -draw 'rectangle 10,10 20,20' -background white +matte -density 300 Loring-Lombard-Autobiogrphy-Pages1-10.pdf Loring-Lombard-Autobiogrphy-Pages1-10.tiff

#Tesseract will add .txt to the end of the new file name
tesseract Loring-Lombard-Autobiogrphy-Pages1-10.tiff Loring-Lombard-Autobiogrphy-Pages1-10
I was able to safely ignore these errors. Once the PDF to .tiff conversion finished I ran the tesseract command to created the text file.

You should now have a text file created. It really is as easy as that to Use Tesseract OCR to Convert PDFs to text files.

Troubleshooting

Missing Language Training Data

If you see something like the bellow error message it means you missed installing the English training data.

Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Simply install the tesseract-ocr-eng package with the below command:

sudo apt install tesseract-ocr-eng

If this doesn’t fix it then check out this GitHub issue for more troubleshooting steps.

Convert Tool Security Policy Error

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.
convert-im6.q16: no images defined `converted-pdf.tiff' @ error/convert.c/ConvertImageCommand/3229.

To fix the above error you need to edit or get rid of the imagemagic security policy. The simplest solution is to temporarily rename the security policy but this may be dangerous if you forget to put it back. Instead, I recommend just edit the policy and remove the offending policy.

sudo sed -i 's/^.*policy.*coder.*none.*PDF.*//' /etc/ImageMagick-6/policy.xml

Checkout this StackOverflow post for more details on working around this error.

Every Good DevOps Engineer Owns This BookDo you?

Do you want to be good at your job and get payed more? Continuous Delivery by Jez Humble and David Farley is THE foundational DevOps text. If you read and comprehend this book you will understand how to craft good CI/CD and what their purpose really is. If you want to be a top tier developer or infrastructure engineer you really must understand the concepts in this book. This is an Amazon Affiliate link, I'll get a commission when you buy this book, which will go towards buying me coffee to write more blog posts. 

Want to see more content like this as well as Severless / AWS news? Sign up for my weekly email list.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Want to see more content like this + Severless / AWS news? Sign up for my weekly email list.