Our products include the ability to scan images using Optical Character Recognition (OCR) on a variety of image file formats.
This article shall talk about the usage of OCR, how it works, and the requirements needed.
Limitations & accuracy of OCR
As with all OCR software capabilities, the accuracy rate will never be as high when compared to scanning raw text data as the process requires a complex algorithm to read the image file and compare each marking within an image to an alphanumeric table in order to determine if a probable match exists.
The accuracy of OCR will be impacted by a number of circumstances including:
- The quality of the image being scanned - 300dpi or higher is recommended
- Abnormally styled fonts that may not be clear or consistent
- Any noise in the image such as scanner marks, lines or soft colour tones
- The size of font - Less than 10pt won't be detected with any level of reliable accuracy
- The format of the image - some image formats will result in better detection rates than others (lossless > lossy)
The latest version of Enterprise Recon has the ability to cache the result of OCR on an image within a scan. This result is reused if the same image is encountered in multiple locations within the scan, improving performance significantly.
An example of this is scanning data sources like email in which identical images frequently occur in different email messages.
To provide a practical scenario and visually display how images are seen by Card/Data Recon during the OCR process, a sample TIFF image file has been created below showing a test card in 8 different sizes ranging from 6 point to 20 point font.
This image is attached at the bottom of this article for your reference and internal testing purposes.
The OCR engine's interpretation of the image and text extraction generated the following results:
6 point font
- 12 of the 16 characters were individually detected with poor accuracy
- The "6pt font" label was not detected
- The number presented to the scanning engine was not a valid card and it was therefore not detected
8 point font
- Notice the "6" was consistently interpreted by the OCR engine as a 5 resulting in number not being a valid or detectable test card by the scanning engine
- The label was interpreted completely different from what was written
10 point font
- The CHD was detected correctly, however the OCR engine was unable to extract the label "10pt font" exactly as written
12 point font
- The CHD was detected correctly and the OCR engine was unable to correctly extract "t font" from the label; "12pt font"
14 point font and larger
- All detected by the OCR engine exactly as written
The below criteria must be cleared for OCR to work:
- Enterprise Recon or Card/Data Recon Advanced
- Characters within scanned images must be at least 10pt in size and 300dpi
- Supported formats:
- PDF containing any of the above
Enabling OCR (CR/DR)
Follow the steps below in Card/Data Recon to enable OCR functionality:
- Launch Card/Data Recon Advanced
- Click on the 'Custom Rules' button shown below
- Then click the 'Add' button and select 'Enable OCR'
- That's it! You may now go back to the main page to configure/start your scan.
Enabling OCR (ER)
Follow the steps below in Enterprise Recon to enable OCR functionality:
- Start a new search
- Select your Scan Target and proceed to the next step
- When choosing your Data Type Profile, hover over the chosen profile and click on the settings icon on the right then select 'Edit New Version'
- Under the 'Advanced Features' tab, flip the 'Enable OCR' switch to 'On', then click 'Ok' once you're done
- You may now proceed with your scan with OCR enabled
All information in this article is accurate and true as of the last edited date.