A Data Catalog must reply on automation!
Some Data Catalog vendors, who hail from the world of cartography, have developed the rhetoric that automation is a secondary topic, which can be addressed at a later stage.
They will tell you that a few manual file imports suffice, along with a generous user community collaborating on their tool to feed and use the catalog. A little arithmetic is enough to understand why this approach is doomed to failure in a data-centric organization.
An active Data Lake, even a modest one, quickly hoovers up, in its different layers, hundreds and even thousands of datasets. Along with these datasets, can be added those from other systems (database applications, various APIs, CRMs, ERPs, noSQL, etc) which we usually want to integrate in the catalog.
The orders of magnitude quickly go beyond thousands, sometimes tens of thousands of datasets. Each dataset contains dozens of fields. Datasets and fields alone represent several hundreds of thousands of objects (we could also include other assets: ML models, dashboards, reports, etc). In order for the catalog to be useful, inventorying those objects isn’t enough.
You also need to combine with them all the properties (metadata) which will enable end users to find, understand, and exploit these assets. There are several types of metadata: technical information, business classification, semantics, security, sensitivity, quality, norms, uses, popularity, contacts, etc. Here again, for each asset, there are dozens of properties.
Back to the arithmetics: Overall, we are dealing with millions of attributes needing to be maintained.
Such volumes alone should disqualify any temptation to choose the manual approach. But there is more. The stock of informational assets isn’t static. It is constantly growing. In a data-centric organization, datasets are created daily, others are moved or changed.
The Data Catalog needs to reflect these changes.
Otherwise, its content will be permanently obsolete and the end users will reject it. Who is going to trust a Data Catalog that is incomplete and wrong? If you feel that your organization can absorb the load and keep your catalog up to date, that’s wonderful. Otherwise, we would suggest you monitor as quickly as possible the level of automation provided by the different solutions you are looking at.
What can we automate in a Data catalog?
In terms of automation, the most important capacity is the inventory.
A Data Catalog should be able to regularly scan all your data sources and automatically update the asset inventory (datasets, structures and technical metadata at a minimum) to reflect the day-to- day reality of the hosting systems.
Believe us: a Data Catalog that cannot connect to your da ta sources will quickly become useless, because its content will always be in doubt.
Once the inventory is completed, the next challenge is to automate the metamodel feed.
Here, beyond the technical metadata, complete automation seems a little hard to imagine. It is still possible to significantly reduce the necessary workload for the maintenance of the metamodel. The value of certain properties can be determined by simply applying rules at the time of the integration of the objects in the catalog.
It is also possible to suggest property values using more or less sophisticated algorithms (semantic analysis, pattern matching, etc.).
Lastly, it’s often possible to feed part of the catalog by integrating the systems that produce or contain metadata. This can apply for instance for quality measurement, for lineage information, for business ontologies, etc.
For this approach to work, the Data Catalog must be open and offer a complete set of APIs that allow the metadata to be updated from other systems.
Take Away
A Data Catalog handles millions of information in a constantly shifting landscape.
Maintaining this information manually is virtually impossible, or extremely costly. Without automation, the content of the catalog will always be in doubt, and the data teams will not use it.