Question

IQ Bot OCR mistaking English/Latin characters for Cyrillic

  • 10 April 2024
  • 0 replies
  • 10 views

Userlevel 1
Badge +4

I’ve run into an odd issue that was causing problems with our IQ bot document extraction.

An invoice has a string which shows something like “PO 123456 - foo bar baz WO 987654” which gets extracted as a full string into the PO number field.

 

I have some logic on this field to extract the numeric portion of the PO, but I discovered the OCR was storing the characters “PO” as unicode 1056 and 1054 which are the Cyrillic PO characters.

 

When my python logic runs, it does not match due to this.

import re
po_match = re.compile(r'PO (\d{6,8})')
match = re.search(po_match, field_value)
if match:
field_value = match[1]
else:
field_value = 'NO MATCH'

I’ve worked around this using some replacements specifically related to this issue.

field_value = field_value.replace(chr(1056),"P")
field_value = field_value.replace(chr(1054),"O")

I’m wondering if there is anything that can be done on the front end with IQ bot.

My learning instance is configured as English, so I’m unsure why this even occurred in the first place.


0 replies

Be the first to reply!

Reply