Business process management and workflow automation diagram with gears and icons with connection line network in background. Manager touching interface

The 7 lies of Data Catalog Providers – #6 A Data Catalog must rely on automation!

July 9, 2021

09 July 2021

The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

A Data Catalog must reply on automation!

Some Data Catalog vendors, who hail from the world of cartography, have developed the rhetoric that automation is a secondary topic, which can be addressed at a later stage.

They will tell you that a few manual file imports suffice, along with a generous user community collaborating on their tool to feed and use the catalog. A little arithmetic is enough to understand why this approach is doomed to failure in a data-centric organization.

An active Data Lake, even a modest one, quickly hoovers up, in its different layers, hundreds and even thousands of datasets. Along with these datasets, can be added those from other systems (database applications, various APIs, CRMs, ERPs, noSQL, etc) which we usually want to integrate in the catalog.

The orders of magnitude quickly go beyond thousands, sometimes tens of thousands of datasets. Each dataset contains dozens of fields. Datasets and fields alone represent several hundreds of thousands of objects (we could also include other assets: ML models, dashboards, reports, etc). In order for the catalog to be useful, inventorying those objects isn’t enough.

You also need to combine with them all the properties (metadata) which will enable end users to find, understand, and exploit these assets. There are several types of metadata: technical information, business classification, semantics, security, sensitivity, quality, norms, uses, popularity, contacts, etc. Here again, for each asset, there are dozens of properties.

Back to the arithmetics: Overall, we are dealing with millions of attributes needing to be maintained.

Such volumes alone should disqualify any temptation to choose the manual approach. But there is more. The stock of informational assets isn’t static. It is constantly growing. In a data-centric organization, datasets are created daily, others are moved or changed.

The Data Catalog needs to reflect these changes.

Otherwise, its content will be permanently obsolete and the end users will reject it. Who is going to trust a Data Catalog that is incomplete and wrong? If you feel that your organization can absorb the load and keep your catalog up to date, that’s wonderful. Otherwise, we would suggest you monitor as quickly as possible the level of automation provided by the different solutions you are looking at.

What can we automate in a Data catalog?

In terms of automation, the most important capacity is the inventory.

A Data Catalog should be able to regularly scan all your data sources and automatically update the asset inventory (datasets, structures and technical metadata at a minimum) to reflect the day-to- day reality of the hosting systems.

Believe us: a Data Catalog that cannot connect to your da ta sources will quickly become useless, because its content will always be in doubt.

Once the inventory is completed, the next challenge is to automate the metamodel feed.

Here, beyond the technical metadata, complete automation seems a little hard to imagine. It is still possible to significantly reduce the necessary workload for the maintenance of the metamodel. The value of certain properties can be determined by simply applying rules at the time of the integration of the objects in the catalog.

It is also possible to suggest property values using more or less sophisticated algorithms (semantic analysis, pattern matching, etc.).

Lastly, it’s often possible to feed part of the catalog by integrating the systems that produce or contain metadata. This can apply for instance for quality measurement, for lineage information, for business ontologies, etc.

For this approach to work, the Data Catalog must be open and offer a complete set of APIs that allow the metadata to be updated from other systems.

Take Away

A Data Catalog handles millions of information in a constantly shifting landscape.

Maintaining this information manually is virtually impossible, or extremely costly. Without automation, the content of the catalog will always be in doubt, and the data teams will not use it.

Download our eBook: The 7 lies of Data Catalog Providers for more!

DOWNLOAD

← Previous Next →

← Vorherige Nächste →

← Précédent Suivant →

Zeenea Actian Logo

At Zeenea, we work hard to create a data fluent world by providing our customers with the tools and services that allow enterprises to be data driven.

Zeenea Actian Logo

Chez Zeenea, notre objectif est de créer un monde “data fluent” en proposant à nos clients une plateforme et des services permettant aux entreprises de devenir data-driven.

Zeenea Actian Logo

Das Ziel von Zeenea ist es, unsere Kunden "data-fluent" zu machen, indem wir ihnen eine Plattform und Dienstleistungen bieten, die ihnen datengetriebenes Arbeiten ermöglichen.

TECHNOLOGY

SOLUTIONS

CAPABILITIES

APPLICATIONS

INDUSTRIES

DATA LEADERS

KNOWLEDGE HUB

PRODUCT HUB

ABOUT

GET IN TOUCH

SERVICES

BELIEFS

The 7 lies of Data Catalog Providers – #6 A Data Catalog must rely on automation!

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog must reply on automation!

The Data Catalog needs to reflect these changes.

What can we automate in a Data catalog?

Believe us: a Data Catalog that cannot connect to your da ta sources will quickly become useless, because its content will always be in doubt.

Take Away

Download our eBook: The 7 lies of Data Catalog Providers for more!

Related posts

Articles similaires

Ähnliche Artikel

Harnessing the Power of AI in Data Cataloging

The Role of Data Catalogs in Accelerating AI Initiatives

[SERIES] Data Shopping Part 2 – The Zeenea Data Shopping Experience

[SERIES] Building a Marketplace for Data Mesh Part 3: Feeding the Marketplace via domain-specific data catalogs

[SERIES] Building a Marketplace for Data Mesh Part 2: Setting up an enterprise-level marketplace

Be(come) data fluent

Devenez Data Fluent

Werden Sie Data Fluent

Product

Capabilities

Use Cases

Resources

Company

Produkt

Funktionalitäten

Use Cases

Ressourcen

Company

Produit

Capacités

Cas d'usage

Ressources

Société