Skip to main content

Good day AA community team! 

I’m hoping to get some recommendations for a roadblock I’m currently facing during my validation process in Document Automation. 

Our business requirement requires us to extract a group of texts from a structured PDF file, but we’re unable to do so successfully. The group of texts are part of multiple separate field regions which I think is causing this failure in the validation process. 

Due to privacy policies, I’m unable to share the exact documents we’re processing, so I’ll share a screenshot from one of the sample invoices from the AAU to try and provide a clearer picture to what I mean by "multiple separate field regions” .

 

 
For example we need to extract the entire “Automation Anywhere” section, which is not part of a single field region (i.e. the address), but separate ones “Automation” and “Anywhere”. 

Unsuccessful attempts done so far from my end: 

  • Created a custom field region by highlighting the entire section “Automation Anywhere”
  • Created a regex pattern to try and extract complete section 
    • Failed: Is only extracting the part of the text within the same field region 
  • Added multiple aliases
  • Tried using both ABBY and Google Vision OCRs. Digital Extractor OCR incompatible with our PDFs.
  • Lowered the confidence level 


Fingers crossed this has been encountered and resolved by others before. I would highly appreciate any recommendations that can be provided :) 

Stay safe and healthy! 

Be the first to reply!