A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges:
- How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
- How to find the most relevant datasets for any specific use case?
At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.
In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:
- The data inventory
- Metadata management
- The search engine
- User experience
It is in the field of metadata management that the notion of the Smart Data Catalog is most commonly associated with algorithms, machine learning, and AI.
How is metadata management automated?
Metadata management is the discipline that consists in valuing the metamodel attributes for the inventoried assets. The workload required is usually proportional to the number of attributes in the metamodel and the number of assets in the catalog.
The role of the Smart Data Catalog is to automate this activity as much as possible, or at the very least to help the human operators (Data Stewards) do so in order to ensure greater productivity and reliability.
As seen in our last article, a smart connectivity layer enables the automation of part of the metadata but this automation is very much restricted to a limited subset of the metamodel – mostly technical metadata. A complete metamodel, even a modest one, also has dozens of metadata that cannot be extracted from the source systems registries (because they are not there, to begin with).
To solve this equation, several approaches are possible:
The most direct approach consists in looking to identify patterns in the catalog in order to suggest metadata values for new assets.
Put simply, a pattern will include all the metadata of an asset and the metadata of its relations with other assets or other catalog entities. Pattern recognition is typically done with the help of machine learning algorithms.
The difficulty with the implementation of this approach is precisely qualifying the information assets in a numerical form in order to feed the algorithms and select the relevant patterns. A simple structural analysis is not enough: two datasets can contain identical data but in different structures. Relying on the identity of the data isn’t efficient either: two datasets can contain identical information but with different values. For example, 2020 client invoicing in one dataset, 2021 client invoicing in the other.
In order to solve this problem, Zeenea relies on a technology called fingerprinting. In order to build the fingerprint, we pull up 2 types of features from our clients’ data:
- A group of features adapted to the numerical data (mostly statistical indicators)
- Data emanating from word embedding models (word vectorization) for the textual data.
Fingerprinting is at the heart of our intelligent algorithms.
The other embedded approaches in a suggestion engine
While pattern recognition is indeed an efficient approach for suggesting the metadata of a new asset in a catalog, it rests on an important prerequisite: in order to recognize a pattern, there has to be one to recognize. In other words, this only works if there are a number of assets in the catalog (which is obviously not the case at the start of a project).
And it’s precisely in these initial phases of a catalog project that the metadata management load is the highest. It is, therefore, crucial to include other approaches likely to help the Data Stewards in these initial phases, when a catalog is more or less empty…
The Zeenea suggestion engine, which provides intelligent algorithms to assist the management of the metadata, also provides other approaches (which we enrich regularly).
Here are some of these approaches:
- Structural similarity detection
- Fingerprint similarity detection
- Name approximation
This suggestion engine, which analyzes the catalog content in order to determine the probable values of the metadata from the assets that have been integrated, is an everlasting subject of experimentation. We regularly add new approaches, sometimes very simple and sometimes much more sophisticated. In our architecture, it is a dedicated service whose performances improve as the catalog grows and as we enrich our algorithms.
Zeenea has chosen to use the lead time as our main measuring metric for the productivity of the Data Stewards (which is the ultimate objective of smart metadata management). Lead time is a notion that stems from lean management and which measures, in a data catalog context, the time elapsed between the moment an asset is inventoried and the moment all its metadata has been valued.