What makes a data catalog “smart”? #3 – Metadata Management

February 16, 2022
February 16, 2022
16 February 2022

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges: 

  • How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
  • How to find the most relevant datasets for any specific use case?

At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

  1. Metamodeling
  2. The data inventory
  3. Metadata management
  4. The search engine
  5. User experience

It is in the field of metadata management that the notion of the Smart Data Catalog is most commonly associated with algorithms, machine learning, and AI.

How is metadata management automated?

Metadata management is the discipline that consists in valuing the metamodel attributes for the inventoried assets. The workload required is usually proportional to the number of attributes in the metamodel and the number of assets in the catalog. 

The role of the Smart Data Catalog is to automate this activity as much as possible, or at the very least to help the human operators (Data Stewards) do so in order to ensure greater productivity and reliability.

As seen in our last article, a smart connectivity layer enables the automation of part of the metadata but this automation is very much restricted to a limited subset of the metamodel – mostly technical metadata. A complete metamodel, even a modest one, also has dozens of metadata that cannot be extracted from the source systems registries (because they are not there, to begin with).

To solve this equation, several approaches are possible:

Pattern recognition

The most direct approach consists in looking to identify patterns in the catalog in order to suggest metadata values for new assets.

Put simply, a pattern will include all the metadata of an asset and the metadata of its relations with other assets or other catalog entities. Pattern recognition is typically done with the help of machine learning algorithms.

The difficulty with the implementation of this approach is precisely qualifying the information assets in a numerical form in order to feed the algorithms and select the relevant patterns. A simple structural analysis is not enough: two datasets can contain identical data but in different structures. Relying on the identity of the data isn’t efficient either: two datasets can contain identical information but with different values. For example, 2020 client invoicing in one dataset, 2021 client invoicing in the other.

In order to solve this problem, Zeenea relies on a technology called fingerprinting. In order to build the fingerprint, we pull up 2 types of features from our clients’ data:

  • A group of features adapted to the numerical data (mostly statistical indicators)
  • Data emanating from word embedding models (word vectorization) for the textual data.

Fingerprinting is at the heart of our intelligent algorithms.

The other embedded approaches in a suggestion engine

While pattern recognition is indeed an efficient approach for suggesting the metadata of a new asset in a catalog, it rests on an important prerequisite: in order to recognize a pattern, there has to be one to recognize. In other words, this only works if there are a number of assets in the catalog (which is obviously not the case at the start of a project).

And it’s precisely in these initial phases of a catalog project that the metadata management load is the highest. It is, therefore, crucial to include other approaches likely to help the Data Stewards in these initial phases, when a catalog is more or less empty…

The Zeenea suggestion engine, which provides intelligent algorithms to assist the management of the metadata, also provides other approaches (which we enrich regularly). 

Here are some of these approaches:

  • Structural similarity detection 
  • Fingerprint similarity detection
  • Name approximation 

This suggestion engine, which analyzes the catalog content in order to determine the probable values of the metadata from the assets that have been integrated, is an everlasting subject of experimentation. We regularly add new approaches, sometimes very simple and sometimes much more sophisticated. In our architecture, it is a dedicated service whose performances improve as the catalog grows and as we enrich our algorithms.

Zeenea has chosen to use the lead time as our main measuring metric for the productivity of the Data Stewards (which is the ultimate objective of smart metadata management). Lead time is a notion that stems from lean management and which measures, in a data catalog context, the time elapsed between the moment an asset is inventoried and the moment all its metadata has been valued.


    For more information on how Smart metadata management enhances a Data Catalog, download our eBook:

    “What is a Smart Data Catalog?”!

    zeenea logo

    At Zeenea, we work hard to create a data fluent world by providing our customers with the tools and services that allow enterprises to be data driven.

    zeenea logo

    Chez Zeenea, notre objectif est de créer un monde “data fluent” en proposant à nos clients une plateforme et des services permettant aux entreprises de devenir data-driven.

    zeenea logo

    Das Ziel von Zeenea ist es, unsere Kunden "data-fluent" zu machen, indem wir ihnen eine Plattform und Dienstleistungen bieten, die ihnen datengetriebenes Arbeiten ermöglichen.

    Related posts

    Articles similaires

    Ähnliche Artikel

    Be(come) data fluent

    Read the latest trends on big data, data cataloging, data governance and more on Zeenea’s data blog.

    Join our community by signing up to our newsletter!

    Devenez Data Fluent

    Découvrez les dernières tendances en matière de big data, data management, de gouvernance des données et plus encore sur le blog de Zeenea.

    Rejoignez notre communauté en vous inscrivant à notre newsletter !

    Werden Sie Data Fluent

    Entdecken Sie die neuesten Trends rund um die Themen Big Data, Datenmanagement, Data Governance und vieles mehr im Zeenea-Blog.

    Melden Sie sich zu unserem Newsletter an und werden Sie Teil unserer Community!

    Let's get started
    Make data meaningful & discoverable for your teams
    Learn more >

    Los geht’s!

    Geben Sie Ihren Daten einen Sinn

    Mehr erfahren >

    Soc 2 Type 2
    Iso 27001
    © 2024 Zeenea - All Rights Reserved
    Soc 2 Type 2
    Iso 27001
    © 2024 Zeenea - All Rights Reserved
    Démarrez maintenant
    Donnez du sens à votre patrimoine de données
    En savoir plus
    Soc 2 Type 2
    Iso 27001
    © 2024 Zeenea - Tous droits réservés.