The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.
These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.
The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.
The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.
Here are, in our opinion, the 7 lies of the Data Catalog vendors:
- A Data Catalog is a Data Governance platform,
- A Data Catalog can measure and manage data quality,
- A Data Catalog can manage regulatory compliance,
- A Data Catalog can query data directly,
- A Data Catalog can model logical architecture and business processes around data,
- A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated,
- A Data Catalog is a long, complex, and expensive project.
A Data Catalog must reply on automation!
Some Data Catalog vendors, who hail from the world of cartography, have developed the rhetoric that automation is a secondary topic, which can be addressed at a later stage.
They will tell you that a few manual file imports suffice, along with a generous user community collaborating on their tool to feed and use the catalog. A little arithmetic is enough to understand why this approach is doomed to failure in a data-centric organization.
An active Data Lake, even a modest one, quickly hoovers up, in its different layers, hundreds and even thousands of datasets. Along with these datasets, can be added those from other systems (database applications, various APIs, CRMs, ERPs, noSQL, etc) which we usually want to integrate in the catalog.
The orders of magnitude quickly go beyond thousands, sometimes tens of thousands of datasets. Each dataset contains dozens of fields. Datasets and fields alone represent several hundreds of thousands of objects (we could also include other assets: ML models, dashboards, reports, etc). In order for the catalog to be useful, inventorying those objects isn’t enough.
You also need to combine with them all the properties (metadata) which will enable end users to find, understand, and exploit these assets. There are several types of metadata: technical information, business classification, semantics, security, sensitivity, quality, norms, uses, popularity, contacts, etc. Here again, for each asset, there are dozens of properties.
Back to the arithmetics: Overall, we are dealing with millions of attributes needing to be maintained.
Such volumes alone should disqualify any temptation to choose the manual approach. But there is more. The stock of informational assets isn’t static. It is constantly growing. In a data-centric organization, datasets are created daily, others are moved or changed.
The Data Catalog needs to reflect these changes.
Otherwise, its content will be permanently obsolete and the end users will reject it. Who is going to trust a Data Catalog that is incomplete and wrong? If you feel that your organization can absorb the load and keep your catalog up to date, that’s wonderful. Otherwise, we would suggest you monitor as quickly as possible the level of automation provided by the different solutions you are looking at.
What can we automate in a Data catalog?
In terms of automation, the most important capacity is the inventory.
A Data Catalog should be able to regularly scan all your data sources and automatically update the asset inventory (datasets, structures and technical metadata at a minimum) to reflect the day-to- day reality of the hosting systems.
Believe us: a Data Catalog that cannot connect to your da ta sources will quickly become useless, because its content will always be in doubt.
Once the inventory is completed, the next challenge is to automate the metamodel feed.
Here, beyond the technical metadata, complete automation seems a little hard to imagine. It is still possible to significantly reduce the necessary workload for the maintenance of the metamodel. The value of certain properties can be determined by simply applying rules at the time of the integration of the objects in the catalog.
It is also possible to suggest property values using more or less sophisticated algorithms (semantic analysis, pattern matching, etc.).
Lastly, it’s often possible to feed part of the catalog by integrating the systems that produce or contain metadata. This can apply for instance for quality measurement, for lineage information, for business ontologies, etc.
For this approach to work, the Data Catalog must be open and offer a complete set of APIs that allow the metadata to be updated from other systems.
A Data Catalog handles millions of information in a constantly shifting landscape.
Maintaining this information manually is virtually impossible, or extremely costly. Without automation, the content of the catalog will always be in doubt, and the data teams will not use it.