a data catalog is not a query solution

The 7 lies of Data Catalog Providers – #4 A Data Catalog is not a Query Solution!

July 2, 2021
July 2, 2021
02 July 2021

The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

 These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog is NOT a Query Solution

 

Here is another oddity of the Data Catalog market. Several vendors, whose initial aim was to allow users to query simultaneously several data sources, have “pivoted” towards a Data Catalog positioning on the market.

There is a reason for them to pivot.

The emergence of Data Lakes and Big Data have cornered them in a technological cul-de-sac that has weakened the market segment they were initially in.

A Data Lake is typically segmented into sever al layers. The “raw” layer integrates data without transformation, in formats that are more or less structured and in great quantities; A second layer, which we’ll call “clean”, will contain roughly the same data but in normalized formats, after a dust down. After that, there can be one or sever al “business” layers ready for use: A data warehouse and visualization tool for analytics, a Spark cluster for data science, a storage system for commercial distribution, etc. Within these layers, data is transformed, aggregated and optimized for use, along with the tools supporting this use (data visualization tools, notebooks, massive processing, etc).

 

In this landscape, a universal self-service query tool isn’t suitable.

 

It is of course possible to set up an SQL interpretation layer on top of the “clean” layer (like Hive) but query execution remains a domain for specialists. The volumes of data are huge and rarely indexed. 

Allowing users to define their own queries is very risky: On on-prem systems, they run the risk of collapsing the cluster by running a very expensive query. And on the Cloud, the bill could run very high indeed. Not to mention security and data sensitivity issues.

 

As for the “business” layers, they are generally coupled with more specialized solutions (such as a combination of Snowflake and Tableau for analytics) that offer very complete and secured tooling, offering great performance for self-service queries. With their market space shrinking like snow in the sun, some multi-source query vendors have pivoted towards Data Catalogs.

Their pitch is now to convince customers that the ability to execute queries makes their solution the Rolls-Royce of Data Catalogs (In order to justify their six-figure pricing). We would invite you to think twice about it…

 

Take Away

On a modern data architecture, the capacity to execute queries from a Data Catalog isn’t just unnecessary, it’s also very risky (performance, cost, security, etc.).

Data teams already have their own tools to execute queries on data, and if they haven’t, it may be a good idea to equip them. Integrating data access issues in the deployment of a catalog is the surest way to make it a long, costly, and disappointing project.

Download our eBook: The 7 lies of Data Catalog Providers for more!

zeenea logo

At Zeenea, we work hard to create a data fluent world by providing our customers with the tools and services that allow enterprises to be data driven.

zeenea logo

Chez Zeenea, notre objectif est de créer un monde “data fluent” en proposant à nos clients une plateforme et des services permettant aux entreprises de devenir data-driven.

zeenea logo

Das Ziel von Zeenea ist es, unsere Kunden "data-fluent" zu machen, indem wir ihnen eine Plattform und Dienstleistungen bieten, die ihnen datengetriebenes Arbeiten ermöglichen.

Related posts

Articles similaires

Ähnliche Artikel

Be(come) data fluent

Read the latest trends on big data, data cataloging, data governance and more on Zeenea’s data blog.

Join our community by signing up to our newsletter!

Devenez Data Fluent

Découvrez les dernières tendances en matière de big data, data management, de gouvernance des données et plus encore sur le blog de Zeenea.

Rejoignez notre communauté en vous inscrivant à notre newsletter !

Werden Sie Data Fluent

Entdecken Sie die neuesten Trends rund um die Themen Big Data, Datenmanagement, Data Governance und vieles mehr im Zeenea-Blog.

Melden Sie sich zu unserem Newsletter an und werden Sie Teil unserer Community!

Let's get started
Make data meaningful & discoverable for your teams
Learn more >

Los geht’s!

Geben Sie Ihren Daten einen Sinn

Mehr erfahren >

Soc 2 Type 2
Iso 27001
© 2024 Zeenea - All Rights Reserved
Soc 2 Type 2
Iso 27001
© 2024 Zeenea - All Rights Reserved
Démarrez maintenant
Donnez du sens à votre patrimoine de données
En savoir plus
Soc 2 Type 2
Iso 27001
© 2024 Zeenea - Tous droits réservés.