The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.
These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.
The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.
The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.
Here are, in our opinion, the 7 lies of the Data Catalog vendors:
- A Data Catalog is a Data Governance platform,
- A Data Catalog can measure and manage data quality,
- A Data Catalog can manage regulatory compliance,
- A Data Catalog can query data directly,
- A Data Catalog can model logical architecture and business processes around data,
- A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated,
- A Data Catalog is a long, complex, and expensive project.
A Data Catalog is NOT a Query Solution
Here is another oddity of the Data Catalog market. Several vendors, whose initial aim was to allow users to query simultaneously several data sources, have “pivoted” towards a Data Catalog positioning on the market.
There is a reason for them to pivot.
The emergence of Data Lakes and Big Data have cornered them in a technological cul-de-sac that has weakened the market segment they were initially in.
A Data Lake is typically segmented into sever al layers. The “raw” layer integrates data without transformation, in formats that are more or less structured and in great quantities; A second layer, which we’ll call “clean”, will contain roughly the same data but in normalized formats, after a dust down. After that, there can be one or sever al “business” layers ready for use: A data warehouse and visualization tool for analytics, a Spark cluster for data science, a storage system for commercial distribution, etc. Within these layers, data is transformed, aggregated and optimized for use, along with the tools supporting this use (data visualization tools, notebooks, massive processing, etc).
In this landscape, a universal self-service query tool isn’t suitable.
It is of course possible to set up an SQL interpretation layer on top of the “clean” layer (like Hive) but query execution remains a domain for specialists. The volumes of data are huge and rarely indexed.
Allowing users to define their own queries is very risky: On on-prem systems, they run the risk of collapsing the cluster by running a very expensive query. And on the Cloud, the bill could run very high indeed. Not to mention security and data sensitivity issues.
As for the “business” layers, they are generally coupled with more specialized solutions (such as a combination of Snowflake and Tableau for analytics) that offer very complete and secured tooling, offering great performance for self-service queries. With their market space shrinking like snow in the sun, some multi-source query vendors have pivoted towards Data Catalogs.
Their pitch is now to convince customers that the ability to execute queries makes their solution the Rolls-Royce of Data Catalogs (In order to justify their six-figure pricing). We would invite you to think twice about it…
On a modern data architecture, the capacity to execute queries from a Data Catalog isn’t just unnecessary, it’s also very risky (performance, cost, security, etc.).
Data teams already have their own tools to execute queries on data, and if they haven’t, it may be a good idea to equip them. Integrating data access issues in the deployment of a catalog is the surest way to make it a long, costly, and disappointing project.