Data quality routines should not “white wash” bad data. In other words, don’t simply replace bad data with good data. That would be like putting a fresh coat of paint over rotting wood. Even replacing the rotting wood isn’t necessarily sufficient. You need to find the cause of the problem and address it. The same is true with bad data. There are plenty of techniques to clean bad data. For example, missing numerical data could be replaced with the previous value collected, or the average of values collected, or the minimum or the maximum, or zero, or an interpolation based on the previous value and the next value that occurred. Whatever correction is made, the original data should be retained. That’s part of the value of a data lake – storing and retaining original raw data for various types of analysis. Data warehouses, on the other hand, tend to offer only clean data, often summarized so it’s impossible to identify anomalies in the source systems and data.
Knowing what data occurred is the first clue in trying to identify and solve the problem. Let’s look at the types of bad data that might occur.
Seeing the history of specific data problems is also useful. Data quality monitoring should include tracking the frequency of issues over time. The remedy for data quality issues should also become data that is retained in the data lake. Machine learning algorithms can examine the quality issues and the associated resolutions to create models that help automate correction of errors. As I’ve written previously, you want to automate your data operations as much as possible so your organization can be as agile as possible.
The bottom line is to make sure your information architecture includes plans to capture and retain the original data – good or bad. Make sure you clearly identify the type of data so those who want to look at the original data can do so. Use this information to create data quality scorecards and set goals for improving or maintaining data quality. Most importantly, identify and resolve the source of data quality problems so your organization can operate with the most accurate data possible.
Regards,
David Menninger