The Intelligence at the Heart of Data Lineage

Data lineage helps address business problems and regulatory demands that are posed in logical business terms. In order to make sense of the physical flow of data in logical terms, automated classification is a must.

March 18, 2020

Categorized in: Data Lineage, Data Classification


Plato image

In our post on the quest for a pragmatic approach to data lineage, we defined data lineage as “the systematic tracing of the paths taken by data through the enterprise”. We defended the necessity of data lineage for enabling key business functions including compliance, data quality, and ecosystem rationalization, and also highlighted the challenging nature of a truly systematic and comprehensive tracing of the flow of information through the enterprise.

As we noted, the manual approach to data lineage cannot achieve a sufficient level of granularity, and is humanly unsustainable in any case. A heavy component of automation must therefore play a role in gaining control of the lineage problem at the level of physical data and systems. However, as we further pointed out, there is a gap to be bridged between the physical representations of data varying across thousands of systems and millions (or more) physical columns on the one hand, and the logical meanings they embody on the other hand.

The automation of data lineage must, then, be able reliably to connect the vast “forest” of varying physical representations with the much smaller number of logical “tree types” expressed in definitions of key business objects such as Customers, Products, Orders, Accounts, Trades, and so forth. This is nothing other than the function provided by data classification, which we described as a foundational requirement for enterprise data management. Data classification unifies “the many” in terms of patterns or concepts that are bearers of logical meaning: in other words, what kind of relationship does this bit of data — and every other bit of data — have to the business?

There has been (and is) a lot of hype surrounding “AI”. But there is no getting around the fact that automated data classification is a function provided by machine intelligence, and that it must be provided in this setting. Classification is a function of Intelligence, but human intelligence cannot cope with the classification problem at the scale of enterprise data. Machine intelligence in the form of automated data classification is therefore a crucial component of enterprise data management, as we have argued, and must therefore be at the heart of data lineage as well.

The bridging of the logical-to-physical gap through automated classification of objects and their attributes is absolutely essential for data lineage to deliver an ongoing, enterprise-wide view of the flow of every logical business object through the entire ecosystem. It is not enough to visualize data flows in physical terms. In order for the comprehensive mapping of data flows to be actionable for such business functions as detailed decision-making and objectively verified compliance, these flows must be logically framed in terms of the core business objects that are in motion.

For that is how business problems and regulatory demands are posed: in logical business terms. The lineage solution that helps to address them must therefore link the physical, in all its seemingly intractable vastness, to the logical. This linkage — which is the function of data classification within data lineage — must be automated to the maximum degree possible. At the same time, it must be intelligent.