Why Your Data Lake Needs Bad Data

Written by David Menninger | May 13, 2021 10:00:00 AM

Everyone talks about data quality, as they should. Our research shows that improving the quality of information is the top benefit of data preparation activities. Data quality efforts are focused on clean data. Yes, clean data is important. but so is bad data. To be more accurate, the original data as recorded by an organization’s various devices and systems is important.

Data quality routines should not “white wash” bad data. In other words, don’t simply replace bad data with good data. That would be like putting a fresh coat of paint over rotting wood. Even replacing the rotting wood isn’t necessarily sufficient. You need to find the cause of the problem and address it. The same is true with bad data. There are plenty of techniques to clean bad data. For example, missing numerical data could be replaced with the previous value collected, or the average of values collected, or the minimum or the maximum, or zero, or an interpolation based on the previous value and the next value that occurred. Whatever correction is made, the original data should be retained. That’s part of the value of a data lake – storing and retaining original raw data for various types of analysis. Data warehouses, on the other hand, tend to offer only clean data, often summarized so it’s impossible to identify anomalies in the source systems and data.

Knowing what data occurred is the first clue in trying to identify and solve the problem. Let’s look at the types of bad data that might occur.

Missing data - Why was the data missing? Was it a system outage? A network outage? A failed data load? Monitoring how often it occurs would be very useful.
Invalid categorical data or unexpected values - Have you properly accounted for all the possible categories? Are there new categories in some of the source applications? How often are the categories changing? Is this something that you need to account for in your analytics? Do you need to adjust historical data for proper comparisons?
Data out of range - either anomalies or out of the defined range of possible values: Are the values valid, but unexpected, e.g., unusually high temperatures or sea levels due to global warming? High electricity consumption due to the installation of electric vehicle charging stations? Negative energy consumption due to selling back energy to the grid?
Variations on a theme - Robert, Rob, Bob, Bobby. Are you capturing the right information, such as the legal name versus a nickname?

Seeing the history of specific data problems is also useful. Data quality monitoring should include tracking the frequency of issues over time. The remedy for data quality issues should also become data that is retained in the data lake. Machine learning algorithms can examine the quality issues and the associated resolutions to create models that help automate correction of errors. As I’ve written previously, you want to automate your data operations as much as possible so your organization can be as agile as possible.

The bottom line is to make sure your information architecture includes plans to capture and retain the original data – good or bad. Make sure you clearly identify the type of data so those who want to look at the original data can do so. Use this information to create data quality scorecards and set goals for improving or maintaining data quality. Most importantly, identify and resolve the source of data quality problems so your organization can operate with the most accurate data possible.

Regards,

David Menninger

View full post