I have some PDFs which I need to get typed up into text to edit. I decided to go with Tesseract OCR as it seems to be the best tool for the job. Here are the steps for how to use Tesseract OCR to convert PDFs to text.
First things first, get Tesseract CLI installed. Follow the instructions here, these are linked to from the official Tesseract docs.
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel sudo apt-get update sudo apt install tesseract-ocr tesseract-ocr-eng
Note: the package didn’t properly place the eng.traineddata file for me. If you get an error about this refer to the troubleshooting steps at the bottom of this article.
In the CLI, cd into the directory with the images or PDFs you want to convert.
Remember, Tesseract cannot convert PDFs, so first we must convert the PDF to a .tiff file, then we can convert the .tiff to text.
#Convert the PDF to a .tiff file, change out the file names at the end of this command to your own #Note: If you get an error about security policy check the troubleshooting section below convert -fill white -draw 'rectangle 10,10 20,20' -background white +matte -density 300 Loring-Lombard-Autobiogrphy-Pages1-10.pdf Loring-Lombard-Autobiogrphy-Pages1-10.tiff #Tesseract will add .txt to the end of the new file name tesseract Loring-Lombard-Autobiogrphy-Pages1-10.tiff Loring-Lombard-Autobiogrphy-Pages1-10
You should now have a text file created. It really is as easy as that to Use Tesseract OCR to Convert PDFs to text files.
Missing Language Training Data
If you see something like the bellow error message it means you missed installing the English training data.
Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
Simply install the tesseract-ocr-eng package with the below command:
sudo apt install tesseract-ocr-eng
If this doesn’t fix it then check out this GitHub issue for more troubleshooting steps.
Convert Tool Security Policy Error
convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421. convert-im6.q16: no images defined `converted-pdf.tiff' @ error/convert.c/ConvertImageCommand/3229.
To fix the above error you need to edit or get rid of the imagemagic security policy. The simplest solution is to temporarily rename the security policy but this may be dangerous if you forget to put it back. Instead, I recommend just edit the policy and remove the offending policy.
sudo sed -i 's/^.*policy.*coder.*none.*PDF.*//' /etc/ImageMagick-6/policy.xml
Checkout this StackOverflow post for more details on working around this error.
Do you want to be good at your job and get payed more? Continuous Delivery by Jez Humble and David Farley is THE foundational DevOps text. If you read and comprehend this book you will understand how to craft good CI/CD and what their purpose really is. If you want to be a top tier developer or infrastructure engineer you really must understand the concepts in this book. This is an Amazon Affiliate link, I'll get a commission when you buy this book, which will go towards buying me coffee to write more blog posts.