One of the most common problems faced by people working with PDF documents is the impossibility of copying the texts contained in those files.
And not because there is some form of protection but because, more simply, the PDF document contains one or more images.
How do I convert images to editable text in Word or any other text editor?
PDF is the main format for sharing content with other people. By exporting your documents to PDF, in fact, you will be sure that their content is displayed, by third parties, in the same way that it is shown on your computer.
Absent fonts are mainly responsible for incorrect layout along with the use of software – word processors – and formats that are not very interoperable.
Our suggestion, therefore, is – where possible – to always create PDF/A files, which makes the PDF document completely “autonomous” and independent from the software configuration used (see How to create PDF files with Windows 10 and without.
PDF is the most widely used format for saving and distributing the content of scanned pages. However, those same pages are usually almost never OCR-obtained, making it impossible to copy the text contained in the PDF document.
To convert images into text and then transform the content of the images that make up the PDF document (such as a series of pages scanned by a scanner) into copyable text elsewhere, you can use an OCR (optical character recognition).
The best advice is to use the excellent PDF-XChange Viewer, an application that also contains an OCR module capable of converting images into text and create a new version of the PDF document containing images with text selectable, copyable elsewhere and searchable.
The procedure to follow to convert images to text from the contents of the PDF file is very simple:
- Download and install PDF-XChange Viewer
- Install the package for the OCR recognition of the Italian language by taking this file and then double clicking on the executable contained in it.
- Start PDF-XChange Viewer and open the PDF document of interest.
- Click on the Document menu, Scan text pages with OCR and set what is displayed in the window as follows:
- When OCR recognition is complete, select File, Save As and store the PDF file under a different name.
- Open the PDF document with PDF-XChange Viewer, Adobe Reader or any PDF file reader. By selecting the text parts that interest you, you can copy them (CTRL+C) and paste them elsewhere (CTRL+V).
A tool not to be underestimated is CopyFish: it presents itself as an extension for Google Chrome and allows you to convert images into text by examining what appears in the tab currently open in the browser.
CopyFish also behaves well with multi-column layouts, succeeding in extracting all the text in the same column below.
The PDF document from which you want to extract texts can be opened with Google Chrome then left to examine by CopyFish, by clicking on the icon shown next to the URL bar of the browser.
Office Online, Google Keep and OneNote integrate an OCR tool that allows you to recognize the text in images and images contained in PDF documents.
The conversion, in this case, takes place online – on Microsoft and Google servers – but the result is similar.
To directly acquire a paper document as a PDF and submit it to OCR, it is not necessary to have a scanner and a software “ad hoc”.
Using the camera on your mobile device, you can instantly scan pages and have them scanned OCR.
Really practical to use and, above all, very effective, always on the Android side, is Tiny Scanner.
Tiny Scanner can create multipage PDFs by capturing, one after the other, images using the digital camera of your mobile device.
Scanned scans can be organized into folders or shared via email, through Dropbox, Evernote, Google Drive and Box or saved directly to your computer via WiFi.
Tiny Scanner does not integrate an OCR but, thanks to the possibility of creating multipage PDFs, it can be used to quickly create optimized documents to switch to the tools mentioned above.