In our previous article, we broke down Data Lineage by presenting the different lineage typologies (physical layer, business layer and semantic layer) and the different levels of granularity (values, fields, datasets, application).
In this article, we will present our matrix to help you concentrate your efforts and resources where the value of Data Lineage is strongest for your different teams.
Our business centered matrix
To fully understand Zeenea’s approach to data lineage, which is centered on the business teams within the company, please read our article on the breakdown of data lineage.
The different business profiles in the organization
We have categorized the populations who wish to leverage the value of Data Lineage in an organization into 4 broad categories:
- IT: The engineers and architects responsible for developing and maintaining the infrastructure, flows and data applications.
- Analytics: The teams in charge of analyzing data, building indicators, dashboards, reports, etc.
- Business: All the people in charge of conceiving and working on the uses and functional applications around the data – project managers, product managers, business analysts, etc.
- Compliance: The teams responsible for regulatory compliance, security, internal control, etc..
Added value of Data Lineage according to the business profile
The following matrix summarizes the value added of Data Lineage for the different combinations of typology, granularity and business profile.
This matrix, bearing in mind that the lineage from the upper level can be deduced from the lineage of the lower level, could tempt one to have lineage management at the field level as the objective: it is on this level that the benefits are the most obvious, and from here that lineage can be produced automatically at the levels above.
Of course, things are not that simple!
While there are many benefits to field-to-field line age, it has one major drawback: Its cost. Whatever the lineage layer being looked into, the production and maintenance cost will depend mainly on two variables: the volume (number of objects taken into account and number of links between them) and the ability to automate the retrieval and updating of this information.
On both these aspects, field-to-field lineage clearly presents the most unfavorable profile…
The limits of field-to-field lineage: huge volumes of information
Concerning the volume, it is easy to understand that the number of materialized fields in an information system, even a modest-sized one, easily reaches tens of thousands, if not hundreds of thousands or even millions. Maintaining the lineage information manually on such a volume of objects is not feasible. The only feasible solution is therefore automation on a large scale.
Limited automation capabilities
In theory, field-to-field technical lineage can be automated by inspecting the different processing stages, from the initial capture of the data to its final uses. In practice, this automation comes up against the very great heterogeneity of data integration and processing solutions. Some vendors offer solutions to perform these operations.
We confess: we don’t believe in those solutions, and for two reasons. First, reverse engineering is a delicate operation and its reliability cannot be 100% guaranteed. And secondly, the range of solutions and languages used in data pipelines is too vast, and the constant innovation in this field makes it difficult for a commercial solution to guarantee full coverage of all the technologies implemented in a given environment.
Field-by-field granularity is attractive, but out of reach in practice.
Our approach for optimized Data Lineage
The pivot: the physical layer at the dataset level
If we go back to the matrix presented above, it appears that the value of the lineage at the dataset level is very close to that of the lineage field-to-field.
For IT, business and analytical profiles, the value is in most cases very similar. The main difference arises with compliance. For most standards, the lineage documentation requirement relates to fields. But compliance does not apply to all data in the organization, only those that are considered critical data elements (CDE).
There are different types of CDEs – personal data, sensitive data, risk data, etc. But they have the advantage of constituting only a minute percentage of all the data, often a few dozen or a few hundred fields whose downstream or upstream lineage must be provided.
Going forward, here is the general approach we favor for the physical layer:
- Focus the effort on the lineage at the dataset level and strive for the most advanced automation possible.
- Associate datasets (and other physical objects on the same level) with the applications to which they are attached. This operation is generally easy to automate, globally stable over time, and can, at worst, be managed manually in the catalog.
- Fill in locally with field-to-field lineage focusing on the CDEs – this can be automated (if possible), but can also rely on periodic review processes which are commonplace in regulatory frameworks.
Business and semantic layers of lineage
As for the other layers (business and semantic), the approach is significantly different. Indeed, in this area, automation is hardly possible. Therefore: business lineage and semantic lineage will probably have to be managed manually.
For business lineage, I propose a top-down approach. This means that the first task should be devoted to defining the business lineage at the application level. The datasets and fields contained in the applications will inherit this business lineage. We should also be able to define the business lineage at a finer level, but only when a use case justifies it.
For the semantic layer, things are a little different. Indeed, a specific effort is necessary to build the glossary. This (modeling) effort will be more or less important depending on the size of your data landscape, and the prior existence of models that can be imported or integrated into the catalog.
The natural anchor point of the semantic model on the physical layer of the lineage is at the field level. But again, automation is impractical – you probably don’t have a system that systematically references the meaning of each field in all your systems.
The association between the fields of the physical layer and the definitions of the semantic layer will therefore have to be done manually, which again represents a time consuming task if you want to do it thoroughly.
Data Lineage is a complex concept, which can be broken down in several layers (physical, business and semantic) and several levels of granularity (value, field, dataset, application).
The value of the lineage can also be represented in the form of a matrix that is very dependent on use cases, and the populations that exploit it. The cost of production and maintenance of lineage information is a function of the automation capacity and the volume of objects at the level considered.