Skip to main content
Question

Document Automation Extraction consistently getting wrong text values

  • March 4, 2026
  • 14 replies
  • 36 views

Forum|alt.badge.img+7

I am getting a supplier’s invoice document consistently extracted incorrectly.

 

The invoice PO number is listed as 20493307.

The extracted text is always 00439307.

The 2 becomes a 0 and the 93 becomes 39.

The rest of the text in the description field is correct. only the PO has this issue.

 

When I view this in validator, the validation recognizes the correct value for that field.

Any tricks I can try for fixing this?

I have tried swapping the OCR engine in my learning instance, but it doesn’t appear to have any affect. 

14 replies

Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand Is this PDF a machine-generated file or a scan? For example, if you open this PDF in Acrobat Reader (or your browser or whatever), can you highlight the text or is it one solid graphic?

If it is machine generated (where you can highlight the text), you may be able to bypass use of the OCR and scrape the text directly, eliminating OCR issues.


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@Aaron.Gleason 

These particular invoices are machine generated.

Is there an easy way to do this?

My current process for AP invoices:

  1. Download all pdf attachments from our AP email mailbox
  2. Send them through document classifier
  3. Create document automation requests for all the “invoice” classified documents
  4. That request then flows through a document workspace process for extraction/validation and download

I could maybe route these through a different path from the email download step, but I was sold on document automation being a simple solution compared to IQ bot having to train all vendor invoice formats manually.


Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand The problem here is the OCR, not Document Automation. My suggestion of bypassing OCR means you will get a perfect extraction of data from the document, but it only works with machine-generated PDFs.

For these kinds of PDFs, you can create a separate learning instance that uses direct text extraction from the document. Otherwise, if you have a machine generated PDF, DA converts it to a graphic and runs OCR on the document.

When creating your learning instance, choose the Direct PDF Extractor option to do direct text extraction.

 


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@Aaron.Gleason 

So I am trying to do this now and copy my learning instance to change the OCR settings.

So far it has giving me an error saying it couldn’t copy the instance, and now I can’t access any of my learning instances. I keep getting a generic server error.

 


Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand Try building a separate learning instance from scratch for now. I would also recommend sending an email to support@automationanywhere.com about the generic server error, as this isn’t something we can fix here.


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@sleemand Try building a separate learning instance from scratch for now. I would also recommend sending an email to support@automationanywhere.com about the generic server error, as this isn’t something we can fix here.

(My generic server error resolved itself after a long wait)

Unfortunately, the digital PDF option doesn’t seem to fix the issue here. 

The description’s PO is still being extracted as 004393307 for some reason. (Now has an extra 3 from the last extraction)

With this method, the validator now doesn’t detect the PO text at all.

 


Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand If you copy and paste the text from the PDF (outside of Automation Anywhere) and put it into Notepad, are you getting the correct PO number? Try highlighting the Shipment Date to the Customer PO lines and pasting that into Notepad.


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@Aaron.Gleason 

Yeah, copy/paste looks to be fine.

I tried multiple times and the value is exactly as it displays on the PDF.

 Highlight when opening the PDF file in Edge browser.

Paste into Notepad++​​​​​

 


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@Aaron.Gleason 

I’m doing some other config testing with the learning instance.

I switched from OpenAI to Anthropic for the genAI extraction and that seems to have worked.

I’m not sure if I should keep this change in production and possibly affect other invoice formats or try and get this specific format working in OpenAI.


Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand Very interesting. Give it a try with your other learning instance and see if that improves things. If not, email me a PDF at aaron.gleason@automationanywhere.com and I’ll give it a look too.


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@Aaron.Gleason 

Is AI heuristic feedback tied to the model chosen here?

I have already split out a different supplier’s invoices to a new learning instance using Anthropic which instantly solved my problems from my OpenAI instance with that supplier.

 

I’m wondering if in my testing, all the feedback got reset by switching model providers and the 35k documents we’ve processed were negatively influencing this behavior.


Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand I certainly hope not, but that is a possibility. 🤔

Either way, the direct text extraction should have eliminated the problem, unless AI goes afterwards and tries to “autocorrect” the entries...


Forum|alt.badge.img+7
  • Author
  • Navigator | Tier 3
  • March 4, 2026

@Aaron.Gleason 

That was my suspicion. It is almost like it is hallucinating the value.

I re-ran that file with trace logging and pulled up the debug files.

 

From what I think I understand in these files, the OCR detects it properly.

But the extracted data is incorrect.

 


Aaron.Gleason
Automation Anywhere Team
Forum|alt.badge.img+14
  • Automation Anywhere Team
  • March 4, 2026

@sleemand I think you’ve hit the nail on the head. I checked with our engineering team and they said the “GenAIVision” tag, if used for that field, could cause this issue.

You could switch to a newer AI model like Gemini. There will also be some enhancements in the near future, so watch the release notes on our docs site.

https://docs.automationanywhere.com/bundle/enterprise-v2019/page/enterprise-cloud/topics/release-notes/cloud-release-notes.html