The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.
These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.
The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.
The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.
Here are, in our opinion, the 7 lies of the Data Catalog vendors:
- A Data Catalog is a Data Governance platform,
- A Data Catalog can measure and manage data quality,
- A Data Catalog can manage regulatory compliance,
- A Data Catalog can query data directly,
- A Data Catalog can model logical architecture and business processes around data,
- A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated,
- A Data Catalog is a long, complex, and expensive project.
A Data Catalog is NOT a Data Governance Solution
This is probably our most controversial stance on the role of a Data Catalog and the controversy originates with the powerful marketing messages pumped out from the world leader in metadata management whose solution is in reality a data governance platform being sold as a Data Catalog.
To be clear, having sound data governance is one of the pillars of an effective data strategy. Governance, however, has little to do with tooling.
Its main purpose is the definition of roles, responsibilities, company policies, procedures, controls, committees…In a nutshell, its function is to deploy and orchestrate, in its entirety, the internal control of data in all its dimensions.
Let’s just acknowledge that data governance has many different aspects (processing and storage architecture, classification, retention, quality, risk, conformity, innovation, etc.) and that there aren’t any universal “one-size fits all” model adapted for all organizations. Like other governance domains, each organization must conceive and pilot its own landscape based on its capacities and ambitions, as well as thorough risk analysis.
Putting in place an effective data governance is not a project, but rather it is a transformation program.
No commercial “solution” can replace that transformation effort.
So where does the Data Catalog fit into all this?
The quest for a Data Catalog is usually the result of a very operational requirement: Once the Data Lake and a number of self-service tools are set up, the next challenge quickly becomes to find out what the Data Lake actually contains (both from a technical and a semantic perspective), where the data comes from, what transformations the data may have incurred, who is in charge of the data, what internal policies apply to the data, who is currently using the data and why etc.
An inability to provide this type of information to the end-user can have serious consequences to an organization, and a Data Catalog is the best means to mitigate that risk. When dealing with the selection of a transverse solution, involving people from many different departments, the selection of the solution is often given to those in charge of data governance, as they appear to be in the best position to coordinate the expectations of the largest number of stakeholders.
This is where the alchemy begins. The Data Catalog, whose initial purpose was to provide data teams with a quick solution to discover, explore, understand, and exploit the data, becomes a gargantuan project in which all aspects of governance have to be solved.
The project will be expected to:
- Manage data quality,
- Manage personal data and compliance (GDPR first and foremost),
- Manage confidentiality, security, and data access,
- Propose a new Master Data Management (MDM),
- Ensure a field by field automated lineage for all datasets,
- Support all the roles as defined in the system of governance and enable the relevant workflow configuration,
- Integrate all the business models produced in the last 10 years for the urbanization program,
- Authorize crossed querying on the data sources while complying with user habilitation on those same sources, as well as anonymizing the results,
Certain vendors manage to convince their client that their solution can be this unique one-stop-shop to data governance. If you believe this is possible, by all means call them, they will gladly oblige. But to be frank, we at Zeenea, simply do not believe such a platform is possible, or even desirable. Too complex, too rigid, too expensive and too bureaucratic, this kind of solution can never be adapted to a data-centric organization.
For us, the Data Catalog plays a key role in a data governance program. This role should not involve supporting all aspects of governance but should rather be utilized to facilitate communication and awareness of governance rules within the company and to help each stakeholder become an active part of this governance.
In our opinion, a Data Catalog is one of the components that delivers the biggest return on investment in data-centric organizations that rely on Data Lakes with modern data pipelines…provided it can be deployed quickly and has a reasonable pricing associated with it.
A Data Catalog is not a data governance management platform.
Data governance is essentially a transformation program with multiple layers that cannot be addressed by one single solution. In a data-centric organization, the best way to start, learn, educate, and remain agile is to blend clear governance guidelines with a modern Data Catalog that can share those guidelines with the end users.