Card Recon, Data Recon, and Enterprise Recon can be used to scan for credit card numbers. This article explains the steps we take to reduce the number of false positives returned.
There are four main methods we use to eliminate False Positives:
- MOD10 verification
Almost all current credit card numbers have a check digit based on the Luhn Algorithm. Where appropriate we validate sequences of numbers identified using this check, which eliminates the majority of false positives quickly.
- Length/Prefix checks
The major credit card manufacturers publish prefix lists for valid cards. Some examples can be found on Wikipedia. Our scanning engine checks for a valid prefix and the appropriate number of digits. This check eliminates 80% of the false positives that pass through the MOD10 check. (Please note that we do not use BIN lists, as these are only available to financial institutions and scheme members. Unofficial lists found online are never 100% accurate and are therefore unsuitable for use in commercial grade software).
- Native format decoding
Our scanning engine natively understands a wide variety of file formats where sensitive data can be found, including simple types (such as Text, CSV, XML, and Microsoft Office), complex types (including Microsoft Outlook PST and OST) and database formats (including Microsoft Access, Microsoft SQL LDF/MDF, and many more). Our ability to decode underlying data structures allows identification of sensitive information in clear unobstructed form which dramatically reduces the likelihood of false positives.
The decoding engine will never skip any file even if the format is unrecognised. Instead it will use a fall-back decoding method that we refer to as generic binary decoding. This will always be used as a last resort and enables the decoding engine to strip out all the binary data that normally causes high false positive levels and scan only the remaining clear text data available.
- Contextual data and statistical analysis
In the early stages of developing our scanning engine Ground Labs' engineering team spent considerable time analysing sets of both genuine matches and false positives in order to determine the characteristics of each, and found that the majority of false positives fall into very specific contextual patterns.
For example, we often identify false positives in uncompressed bitmap data (images, sound, icons, etc). Even if the file type cannot be identified automatically bitmap data can be readily identified to an almost certain degree of accuracy by examination of approximately 300 bytes worth of data before and after the match and applying a series of algorithms to determine its true context.
Similar methods can be used for other data types, for example to determine if a 16-digit string is really a PAN, or a web site cookie token. Analysis over the overall findings against a given file will also provide further verification as to the likelihood of any given finding.
Our software is designed to err on the side of caution. To that end, if we are not completely certain that a match is a false positive then we will report it as a PAN (Primary Account Number) and allow the end-user to determine the accuracy of the match.
All information in this article is accurate and true as of the last edited date.