The Quest for a Pragmatic Approach to Automate Data Lineage

In this post, we introduce the first of an ongoing series of data management topics: Data Lineage.

February 21, 2020

Categorized in: Data Lineage, Compliance, Data Quality, Automation

Share:

Getty Images 849212674

As Socrates is reputed to have said, “The beginning of wisdom is the definition of terms.” The area of data management, like many other areas in business technology, is rife with jargon, buzzwords, and terms whose meanings are vague, overlapping, or redundant. Definitions are necessary to cut through the confusion, and on this blog we will make a point of both minimizing unnecessary jargon and defining the key terms we do use.

As background for defining data lineage, consider that every bit of enterprise data has a point of origin — which may be either external or internal to the enterprise — and a path that it takes as it flows through the systems that make up your enterprise data infrastructure. This flow path may have branches, and applications may augment or transform the data along the way.

Data lineage is the systematic tracing of the paths taken by data through your enterprise, from its points of origin to the various endpoints where it is consumed by business users, reviewed by outside agencies (compliance), or archived for posterity (data warehouse).

The systematic tracing of the flow of enterprise data has become a necessary function of enterprise data management. Consider some of the key business functions that depend on it:

  • Compliance. When a regulator questions a value in a report, what he or she wants to know is: “Where did this value come from?” Is it a computed value? What values did the computation depend on, and where did they come from?” Data lineage answers directly this question of the provenance of data.
  • Data quality. Where are the critical points in the flow of each type of business data where the quality of the data should be subject to monitoring and control? Data lineage maps the existing flow of each business object, thus enabling the most effective placement of data quality checks and controls.
  • Planning and impact analysis of large-scale changes to data infrastructure. Large-scale changes include rationalization and streamlining business processes involving multiple systems, removing redundant applications and data, retiring legacy applications and systems, locating and locking down sensitive data, and many others. An accurate knowledge of the existing flow of data is necessary for rational planning as well as accurate prediction of the impact of any change on particular business functions.

Data lineage, however, is as challenging as it is necessary:

Manual approaches have typically been used for tracing data flow through the enterprise. But even for enterprises that are not large, human resources are inadequate to trace data flow comprehensively, and at a level of granularity sufficient to establish provenance of particular data values, for example.

Manual approaches also lack the objective verification that automation could provide, that regulators increasingly demand, and that adds rigor and rationality to project planning and decision-making.

Automated approaches to data lineage, however, face their own hurdles. The computational cost of exhaustively scanning the data environment to detect the flow of particular business objects between applications and databases is enormous and increasingly prohibitive, as data environments and the number of possible flow points grow.

Further, the data flows we wish to trace are precisely the flows of key business objects (Customer, Product, Order, Account, etc.) through the environment. That is, we want to trace the flow of objects of a particular logical type, regardless of the physical form it may take in this database table or that. Given the great multitude of ways in which the same type of object is physically represented across systems, there is a logical-to-physical gap that must be bridged, and this requires advanced algorithms for logical classification of hundreds of thousands or millions (or more) of physical database fields.

We are left, then, with the question: What would a truly effective and pragmatic approach to data lineage for today’s enterprise data environments look like? We will explore the answer to this question in future posts.