Question

To read Image in the PDFs

  • 4 December 2023
  • 6 replies
  • 63 views

Badge +1

Hello, I hope you are well.

Requirement of my job, I need to compare many documents with each other and for this reason, I use some online comparison sites. However, since the data is given in the form of images in some PDFs, I cannot compare these data. How can I convert the data given as an image in a PDF to text format? I have tried extracting text from Images and PDFs with the OCR package, but this is not exactly what I need. I need data as an image in PDF as readable. I want to get a readable version of the data given as an image and I want to see this data in the same PDF with the rest of the information. If there is a way to do it, please can you help me?

Thanks in advance!


6 replies

Userlevel 3
Badge +5

There are a number of ways of doing this, but before we start, we need to find out how the PDF was created. At a high level, there are two methods: A PDF created from scanning a physical document, and a PDF created by a print-to-file or export. You can usually tell the difference by opening the PDF and trying to highlight the text. If you cannot highlight the text, it’s likely a scanned PDF. If you can, it is likely a print-to-file/export PDF.

With print-to-file/export PDFs, you can use the PDF package and the Export as Text action. This will create a text file with or without formatting (e.g., extra spaces to simulate the position of the text within the PDF). No OCR is needed.

With scanned PDFs, OCR or Document Automation is your only choice.

Badge +1

I am kindly sharing a part of the pdf. As you can see I can highlight the text and and extract them. But I cannot extract data in the image. The degree decimal data on the right is in image format, so I cannot extract them with the texts in the PDF. I need to get these degree decimal data as text format.

Userlevel 3
Badge +5

This PDF is a hybrid because the table of data is an image. OCR is your only choice here. Be cautious: OCR accuracy is not 100% with any company’s OCR. The darker backgrounds will also make the OCR’s job even more difficult.

You are not likely to get a satisfactory extraction using OCR. Document Automation may be able to do better than the OCR functionality, but since Document Automation also uses OCR, it would have similar issues.

I would contact the source of this information and see if they can make the data available in a different format.

As far as converting the values into decimal degrees, I might suggest a Python function like what is found here: https://stackoverflow.com/questions/33997361/how-to-convert-degree-minute-second-to-degree-decimal

Userlevel 3
Badge +7

@ilben.isidi ,

I think Document Automation will definitely do the job for you. You can select the OCR as Google vision that will extract these data fine. If this is the constant quality of your document, i don’t see any issue.

Userlevel 3
Badge +5

Just be careful. I just ran it through Google Vision and the result was NOT 100% accurate. I used a direct call to Google Vision’s API to perform this analysis, not through Document Automation.

Userlevel 3
Badge +7

You can try out enhancing the image using enhance image command.

Reply