As a concept, Data Lineage seems universal: whatever the sector of activity, any stakeholder in a data-driven organization needs to know the origin (upstream lineage) and the destination (downstream lineage) of the data they are handling or interpreting. And this need has important underlying motives.
For a Data Catalog vendor, the ability to manage Data Lineage is crucial to its offer. As is often the case however, behind a simple and universal question lies a world of complexity that is difficult to grasp. This complexity is partially linked to the heterogeneity of answers that vary from one interlocutor to another in the company.
In this article, we will explain our approach to breaking down data lineage according to the nature of the information sought and its granularity.
The typology of Data Lineage: seeking the origin of data
There are many possible answers as to the origin of any given data. Some will want to know the exact formula or semantics of the data. Others will want to know from which system(s), application(s), machine(s), or factory it comes from. Some will be interested in the business or operational processes that produced the data. Some will be interested in the entire upstream and downstream technical processing chain. It’s difficult to sort through this maze of considerations!
A layer approach
To structure lineage information, we suggest emulating what is practiced in the field of geo-mapping by distinguishing several superimposable layers. We can identify three:
- The physical layer, which includes the objects of the information system – applications, systems, databases, data sets, integration or transformation programs, etc.
- The business layer, which contains the organizational elements – domains, business processes or activities, entities, managers, controls, committees, etc.
- The semantic layer, which deals with the meaning of the data – calculation formulas, definitions, ontologies, etc.
A focus on the physical layer
The physical layer is the basic canvas on which all the other layers can be anchored. This approach is again similar to what is practiced in geo-mapping: above the physical map, it is possible to superimpose other layers carrying specific information.
The physical layer represents the technical dimension of the lineage; it is materialized by tangible technical artifacts – databases, file systems, integration middleware, BI tools, scripts and programs, etc. In theory, the structure of the physical lineage can be extracted from these systems, and then largely automated, which is not generally the case for the other layers.
The following seems fundamental: for this bottom-up approach to work, it is necessary that the physical lineage be complete.
This does not mean that the lineage of all physical objects must be available, but for the objects that do have lineage, this lineage must be complete. There are two reasons for this. The first reason is that a partial (and therefore false) lineage risks misleading the person who consults it, jeopardizing the adoption of the catalog. Secondly, the physical layer serves as an anchor for the other layers which means any shortcomings in its lineage will be propagated.
In addition to this layer-by-layer representation, let’s address another fundamental aspect of lineage: its granularity.
Granularity in Data Lineage
When it comes to lineage granularity, we identify 4 distinct levels: values, fields (or columns), datasets and applications.
The values can be addressed quickly. Their purpose is to track all the steps taken to calculate any particular data (we’re referring to specific values, not the definition of any specific data). For mark-to-model pricing applications, for example, the price lineage must include all raw data (timestamp, vendor, value), the values derived from this raw data as well as the versions of all algorithms used in the calculation.
Regulatory requirements exist in many fields (banking, finance, insurance, healthcare, pharmaceutical, IOT, etc.), but usually in a very localized way. They are clearly out of the reach of a data catalog, in which it is difficult to imagine managing every data value! Meeting these requirements calls for either a specialized software package or a specific development.
The other three levels deal with metadata, and are clearly in the remit of a data catalog. Let’s detail them quickly.
The field level is the most detailed level. It consists of tracing all the steps (at the physical, business or semantic level) for an item of information in a dataset (table or file), a report, a dashboard, etc., that enable the field in question to be populated.
At the dataset level, the lineage is no longer defined for each field but at the level of the field container, which can be a table in a database, a file in a data lake, an API, etc. On this level, the steps that allow us to populate the dataset as a whole are represented, typically from other datasets (we also find on this level other artifacts such as reports, dashboards, ML models or even algorithms).
Finally, the application level enables the documentation of the lineage macroscopically, focusing on high-level logical elements in the information system. The term “application” is used here in a generic way to designate a functional grouping of several datasets.
It is of course possible to imagine other levels beyond those 3 (grouping applications into business domains, for example), but increasing the complexity is more a matter of flow mapping than lineage.
Finally, it is important to keep in mind that each level is intertwined with the level above it. This means the lineage from the higher level can be worked out from the lineage of the lower level (if I know the lineage of all the fields of a dataset, then I can infer the line age of this dataset).
We hope that this breakdown of data lineage will help you better understand it for your organization. In a future article, we will share our approach so that each business can derive maximum value from Lineage thanks to our typology / granularity / business matrix.