a data catalog is not a query solution

The 7 lies of Data Catalog Providers – #4 A Data Catalog is not a Query Solution!

July 2, 2021

The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

 These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

  1. A Data Catalog is a Data Governance platform,
  2. A Data Catalog can measure and manage data quality,
  3. A Data Catalog can manage regulatory compliance,
  4. A Data Catalog can query data directly,
  5. A Data Catalog can model logical architecture and business processes around data,
  6. A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated,
  7. A Data Catalog is a long, complex, and expensive project.

A Data Catalog is NOT a Query Solution

 

Here is another oddity of the Data Catalog market. Several vendors, whose initial aim was to allow users to query simultaneously several data sources, have “pivoted” towards a Data Catalog positioning on the market.

There is a reason for them to pivot.

The emergence of Data Lakes and Big Data have cornered them in a technological cul-de-sac that has weakened the market segment they were initially in.

A Data Lake is typically segmented into sever al layers. The “raw” layer integrates data without transformation, in formats that are more or less structured and in great quantities; A second layer, which we’ll call “clean”, will contain roughly the same data but in normalized formats, after a dust down. After that, there can be one or sever al “business” layers ready for use: A data warehouse and visualization tool for analytics, a Spark cluster for data science, a storage system for commercial distribution, etc. Within these layers, data is transformed, aggregated and optimized for use, along with the tools supporting this use (data visualization tools, notebooks, massive processing, etc).

 

In this landscape, a universal self-service query tool isn’t suitable.

 

It is of course possible to set up an SQL interpretation layer on top of the “clean” layer (like Hive) but query execution remains a domain for specialists. The volumes of data are huge and rarely indexed. 

Allowing users to define their own queries is very risky: On on-prem systems, they run the risk of collapsing the cluster by running a very expensive query. And on the Cloud, the bill could run very high indeed. Not to mention security and data sensitivity issues.

 

As for the “business” layers, they are generally coupled with more specialized solutions (such as a combination of Snowflake and Tableau for analytics) that offer very complete and secured tooling, offering great performance for self-service queries. With their market space shrinking like snow in the sun, some multi-source query vendors have pivoted towards Data Catalogs.

Their pitch is now to convince customers that the ability to execute queries makes their solution the Rolls-Royce of Data Catalogs (In order to justify their six-figure pricing). We would invite you to think twice about it…

 

Take Away

On a modern data architecture, the capacity to execute queries from a Data Catalog isn’t just unnecessary, it’s also very risky (performance, cost, security, etc.).

Data teams already have their own tools to execute queries on data, and if they haven’t, it may be a good idea to equip them. Integrating data access issues in the deployment of a catalog is the surest way to make it a long, costly, and disappointing project.

Download our eBook: The 7 lies of Data Catalog Providers for more!

zeenea logo

At Zeenea, we work hard to create a data fluent world by providing our customers with the tools and services that allow enterprises to be data driven.

zeenea logo

Chez Zeenea, notre objectif est de créer un monde “data fluent” en proposant à nos clients une plateforme et des services permettant aux entreprises de devenir data-driven.

Be(come) Data Fluent

Read the latest trends on big data, data cataloging, data governance and more on Zeenea’s data blog.

Join our community by signing up to our newsletter!

Devenez Data Fluent

Découvrez les dernières tendances en matière de big data, data management, de gouvernance des données et plus encore sur le blog de Zeenea.

Rejoignez notre communauté en vous inscrivant à notre newsletter !

LET’S GET STARTED

Make data meaningful & discoverable for your teams

Démarrer MAINTeNaNT

Donnez du sens à votre patrimoine de données