data quality

The 7 lies of Data Catalog Providers – #2 A Data Catalog is NOT a Data Quality Management Solution

June 21, 2021

The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

 These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog is NOT a Data Quality Management (DQM) Solution

 

We at Zeenea, do not underestimate the importance of data quality in successfully delivering a data project, quite the contrary. It just seems absurd to me to put this in the hands of a solution, which by its very nature, cannot achieve the controls at the right time.

Let us explain.

There is a very elementary rule to quality control, a rule that can be applied virtually in any domain where quality is an issue, be it an industrial production chain, software development, or the cuisine of a 5-star restaurant: The sooner the problem is detected, the less it will cost to correct.

To demonstrate the point, a car manufacturer is unlikely to refrain from testing the battery of a new vehicle until after its built and all the production costs have already been incurred and solving a defect would cost the most. No. Each piece is closely controlled, each step of the production is tested, defective pieces are removed before ever being integrated in the production circuit, and the entire chain of production can be halted if quality issues are detected at any stage. The quality issues are corrected at the earliest possible state of the production process where they are the least costly and the most durable.

 

“In a modern data organization, data production rests on the same principles. We are dealing with an assembly chain whose aim is to provide usage with high added value. Quality control and correction must happen at each step. The nature and level of controls will depend on what the data is used for.”

 

If you are handling data, you obviously have at your disposal pipelines to feed your uses. These pipelines can involve dozens of steps – data acquisition, data cleaning, various transformations, mixing various data sources, etc.

In order to develop these pipelines, you probably have a number of technologies at play, anything from in-house scripts to costly ETLs and exotic middleware tools. It’s within those pipelines that you need to insert and pilot your quality control, as early as possible, by adapting them to what is at stake with the end product. Only measuring data quality levels at the end of the chain isn’t just absurd, it’s totally inefficient.

It is therefore difficult to see how a Da ta Catalog (whose purpose is to inventory and document all potentially usable datasets in order to facilitate data discovery and usage) can be a useful tool to measure and manage quality.

A Data Catalog operates on available datasets, on any systems that contain data, and should be as least invasive as possible in order to be deployed quickly throughout the organization.

A DQM solution works on the data feed (the pipelines), focuses on production data and is, by design, intrusive and time consuming to deploy. I cannot think of any software architecture that can tackle both issues without compromising the quality of either one.

 

Data Catalog vendors promising to solve your data quality issues are, in our opinion, in a bind and it seems unlikely they can go beyond a “salesy” demo.

 

As for DQM vendors (who also often sell ETLs), their solutions are often too complex and costly to deploy as credible Data Catalogs.

The good news is that the orthogonal nature of data quality and data cataloging makes it easy for specialized solutions in each domain to coexist without encroaching on each other’s lane.

Indeed, while a data catalog isn’t purposed for quality control, it can exploit the information on the quality of the datasets it contains which obviously provides many benefits.

The Data Catalog uses this metadata for example to share the information (and possible alerts it may identify) with the data consumers. The catalog can benefit from this information to adjust his search and recommendation engine and thus, orientate other users towards higher quality datasets.

And both solutions can be integrated at little cost with a couple of APIs here and there.

 

Take Away

Data quality needs to be assessed as early as possible in the pipeline feeds.

The role of the Data Catalog is not to do quality control but to share as much as possible the results of these controls. By their natures, Data Catalogs are bad DQM solutions, and DQM solutions are mediocre and overly complex Data Catalogs.

An integration between a DQM solution and a Data Catalog is very straightforward and is the most pragmatic approach.

Download our eBook: The 7 lies of Data Catalog Providers for more!

zeenea logo

At Zeenea, we work hard to create a data fluent world by providing our customers with the tools and services that allow enterprises to be data driven.

zeenea logo

Chez Zeenea, notre objectif est de créer un monde “data fluent” en proposant à nos clients une plateforme et des services permettant aux entreprises de devenir data-driven.

Be(come) Data Fluent

Read the latest trends on big data, data cataloging, data governance and more on Zeenea’s data blog.

Join our community by signing up to our newsletter!

Devenez Data Fluent

Découvrez les dernières tendances en matière de big data, data management, de gouvernance des données et plus encore sur le blog de Zeenea.

Rejoignez notre communauté en vous inscrivant à notre newsletter !

LET’S GET STARTED

Make data meaningful & discoverable for your teams

Démarrer MAINTeNaNT

Donnez du sens à votre patrimoine de données