What is today's data panacea? In my many decades of work in the data disciplines, dozens of technologies have emerged claiming they will "fix" bad data. Creating a data solution that works for everyone is incredibly difficult. The trouble comes, of course, when you think about “data” both tactically and strategically. It’s hard to create corporate or master data strategies when every user group you talk to has a different idea of what “good” should look like from their perspective.
Photo Credit: https://xkcd.com/2494/
When attempting to satisfy the needs of many diverse expectations, the result is often a solution that pleases no one, confuses everyone and leads to poor decision making and increased risk.
Decades ago, relational databases did a great job of keeping data structures aligned, and remain a solid workhorse for data management, although they are "grandma's technology" unexciting for newcomers. A mere four decades ago, we ("we" is pretty ubiquitous) thought that maybe if we all shared the same container design, all of our data woes would vanish. Not so, alas.
From relational containers, we leapt with increasing vigor (desperation?) to NoSQL and NewSQL technologies. The problem was the same – things became even worse for data quality and integrity. Will a rinse and repeat with JSON, YAML, Avro, Parquet, Optimized Row Columnar, etc., solve this problem? Technology is, of course, not magic, so success hinges on other factors.
Shared container designs expose, but don’t resolve, differences in semantics or content. Through the years, we discovered (but sadly didn’t learn the lesson) that data problems are not solved by a “standard format”, regardless of the technology the data container format uses.
An assumption of success grounded in the use of data container technology is pure fallacy. "We" know that but keep trying again and again but pretty much the same result.
"We" have tried methodologies. Data warehouses, data lakes, data lake houses, and more have been tried. All have the potential to provide significant benefits, but only when commensurate attention is spent on the data being inserted into the system.
The trouble, of course, is that data requires constant attention to stay good—something budgets and budgeters don't like to see as repeating line items year after year. An assumption of success grounded in the use of methodology is fallacy. Methodology is part of the equation but can’t stand alone.
We tried data mastering, data governance, and data quality. These are all extremely powerful methods. However, the majority of implementations were based on software tools (technology) in the hope that this alone would make their data “good”. Many implementations struggled with the people, culture, and process aspects of these methods. It is people (hard work) and process (solid thinking, planning and execution) that spell success, not software.
Data exchange has been (correctly) thought to be a significant part of the data problem. Indeed, since most of our data comes from external sources, any differences between sources—the PPDM Association calls this Data Dissonance—create conflicts that need to be resolved. Despite many determined efforts with data exchange formats (EDI, CSV, XML, etc.), without conformity in content, data exchanges have not solved our data problems. More commonly, they make data problems even more difficult to resolve.
Today, I hear a lot about how AI is going to solve our data problems. I'm astonished at our persistent belief that the right technology will solve our data problems. After so many decades, data professionals know that technology is a tool, but not a panacea.
At the PPDM Association 2024 Houston Convention, we heard from many AI experts from many industry sectors. Guess what they said? All of them. Successful deployment and use of AI depend on a framework of good, well-managed data—or it will fail. Dismally.
Anyone who thinks
AI will produce sound data magically needs to promote logical thinking by eating more chocolate (dark, of course). It won't.
But it can help! PPDM members are developing a technology-neutral framework that helps describe what good data should look like and how it should behave. A standard framework of PPDM data objects will help train people and technology and inform data processes. We can get data “right” from the start and retain data integrity, prevent data attenuation, eliminate data dissonance to build and keep data trusted. And that’s a win for everyone.
Continue to follow our socials to find out about our upcoming events, workshops, and projects.