originally published August 2012, http://www.EndecaCommunity.com
The adage of “Garbage In and Garbage Out” is familiar, but the risk is also described as “Garbage In and Gospel Out”, a distinction which may resonate with many BI practitioners. Particularly when dealing with aggregated data the risk of invalid business decisions from bad data can be costly and hard to detect. I’ve seen this translate for a client to millions in estimated loss. The priority around data quality can’t be understated, and once users lose trust in data accuracy it can be challenging to win back.
Determining how to validate data relies first on identifying what constitutes invalid data? We can often leverage relational database engines to enforce a lot of domain and referential integrity through column data types and table constraints. They don’t however address all possible checks and as data volumes flow through multiple systems they aren’t always applicable. In building validation solutions I start by focusing on 5 areas for validation:
- Data Formats – The simplest validation is to conform to a data type. A number must be a number; a date must be a date. These can be easy to enforce yet many systems and data sources exist where all data types are generic. A .csv file for example doesn’t apply constraints but is a standard medium for data exchange.
- Range Checks – Having satisfied data types constraints we may still have values that make no business sense. A birth date 50 years in the future is technically a valid date, but clearly impossible. Values may have boundaries that can be identified and enforced.
- Lookups – In some cases lookups may be referential (uniqueness of a value, valid foreign key references) and can be dynamically checked. Identifying static values so end users can maintain them can also be helpful to evolve with the business. ISO publishes country codes, but countries still come and go.
- Synchronicity – Oftentimes data will have passed the previous validations, but won’t make sense with respect to other data. For example a student with a graduation date 4 years before their enrollment, or a program enrollment for a year that program wasn’t offered.
- Trends – This last category considers firstly what would be considered a reasonable change in a value. For example a water bill for a customer that jumped from a residential average to a level akin to a nuclear reactor. More insidious to detect are the slowly changing aggregates. A few refund transactions aren’t unusual, but when the percentage of refunds for a store goes from 2% to 40% this may suggest a data corruption issue exists.
The second critical question is who will take responsibility for monitoring and dealing with validation issues? There may be false positives that should be reviewed and accepted so they don’t keep getting reported. In some cases common data issues can be auto corrected or replaced with a default value. The validation rules themselves could become outdated or even start to introduce data corruption.
The bottom line is the complete solution needs to include a plan for monitoring and resolving any issues that come up. The business has to commit or assign responsibility for data stewardship or just like a garden their data will start to grow weeds. Validating data is an ongoing challenge.