On June 16, 2021, Mr. Christopher Williamson, a resident of Georgia, USA, woke up, checked his Coinbase cryptocurrency trading account and found (to his surprise) that the balance exceeded $1 trillion. He had unexpectedly become the first trillionaire in the history of humanity. It was the Information Age equivalent of finding a pot of gold at the end of the rainbow — raised by many, many orders of magnitude.
While we don’t know the myriad of thoughts that must have run through Mr. Williamson’s mind that day, it is highly probable one of them was:
“How did it get there?”
In asking this question, Mr. Williamson is far from alone. It’s what is asked every day by countless business users in enterprises across the planet as they are confronted by strange “blips” of data, startling inconsistencies, or numbers that make no sense in the reports and dashboards they rely on to do their jobs. These kinds of issues are rarely caused by the reports themselves suddenly breaking. They are caused by something going wrong with the data, and everyone intuitively understands that.
We can rephrase the question that will naturally be asked to: “Where did the data come from?” This makes it a lot clearer that when business users see these issues, they want to know what data came from where, what route did it take to get into the report, and what happened to it on the way. In other words, they want to know the data lineage.
Can data lineage be known? The answer is that it can be, but what is often presented as “data lineage” falls far short of what is truly needed.
Business users must understand key concepts about data lineage so they effectively participate if they become involved in the acquisition of data lineage technologies. If business users do not understand these concepts they will not be able to communicate their requirements effectively, and may get saddled with a technology that does not give them what they need.
What is Data Lineage?
Typically, a business user comes from the perspective of an endpoint — a production report — and wants to look back through the data pipeline to understand how the data got into that report. That is clear enough, but the term “data lineage” is used in many different ways by consultants and vendors in the data industry which often drives confusion.
The three main versions for data lineage are:
Data Relationships: The visualization of any relationship in the data is sometimes branded as “data lineage.” For instance, consider the relationship where one Customer can have many Accounts. This can be represented in a diagram generated on a screen by a box that appears for a Customer, some boxes for Accounts, and lines joining the Customer box to the Account boxes. This is described as a “data lineage diagram.” However, it is not. There is no flow of data in these types of diagrams. They are structural, logical diagrams. They certainly have a lot of value, but they are not data lineage, and they will not help you find out how strange numbers got into a report.
Data Roadmaps: We’ve probably all used Google Maps, Waze or some equivalent of GPS navigation, and we know how incredibly useful they are. If one of us invites a guest to our house, we just need to give the guest our address, and they can use one of these tools to figure out how to get there. There is a type of data lineage that is like this, too. It is essentially a map that shows the route or routes that data can take as it flows through data pipelines and gets to the report where we found the problem we are trying to deal with. (We will come back to data roadmaps in a moment.)
Data Provenance: In the world of fine art, notorious for forgery, theft and ownership disputes, “provenance” is a key concept. It is the chain of custody of legal ownership, back to the original artist, which must be documented, warranted and provable. In data lineage, this is being able to prove how a particular piece of data — a data value — got to an endpoint, like a report where a Business User suspects there is a problem. It is very similar to provenance in fine art; it is the chain of custody of an individual data value.
What Business Users Need to Know
In today’s data industry, there are a plethora of tools and methodologies that describe themselves as data lineage solutions. But as we have just seen, the term “data lineage” can describe different kinds of things. Very often, business users get brought into data lineage discussions by their technical partners in IT, and it is crucial for business users to recognize, at a high level, what is being presented to them. Otherwise, they risk being railroaded into providing support for a decision that may not provide what they need.
As we have seen, anything that visualizes general “static” or logical relations among data and has nothing to do with data flow is not data lineage. No matter how useful it is for other use cases and how impressive the visualization looks, it should not be considered any further by business users who need to know data lineage.
Data roadmaps are more useful —but only up to a point. If a business user wants to know how a value of $1 trillion ended up in a client’s statement, these roadmaps will only show the possible routes data could take to reach the endpoint.
When we use GPS navigation to plan a trip, we are presented with options, such as “avoid toll roads” or “where to stop for gas on the way” or perhaps scenic waypoints. Also, the alternatives are different depending on if we want to drive, take a bus, or walk.
The situation with data is analogous. There are myriad ways data can travel from point A to point B. Just by looking at the map — however beautiful the visualization — we cannot tell which route was actually taken.
Data provenance is the most precise form of data lineage. A strange data value can be searched for in all the data stores upstream to see where it is. Data generally stays in data stores for a long time, so this is a practical option. It should be remembered that data does not really “travel.” A copy of a data value is made and put into a data store further down the data pipeline, and this is repeated over and over again, giving the impression that data is being moved. In reality, it is a kind of cascading copy. As a result, data provenance is achievable since a value can be traced back through the pipeline. This is what business users need to get real answers to the complex problem of why a strange data value showed up in one of their reports.
Understand Data Lineage When Selecting Tools
Business users are increasingly voicing their need for data lineage solutions. Getting new enterprise-class software is nearly always the responsibility of the IT department, so business users will have to partner with IT for any data lineage tool acquisition. This raises the risk of business users being confronted by a dizzying array of technical features and undoubtedly beautiful visualizations that may or may not be useful. It is vital that business users understand the concepts we have described to get to a solution that will truly satisfy their needs.