Introduction: what is data mesh?
As companies are becoming more aware of the importance of their data, they are rethinking their business strategies in order to unleash the full potential of their information assets. The challenge of storing the data has gradually led to the emergence of various solutions: data marts, data warehouses and data lakes, to enable the absorption of increasingly large volumes of data. The goal? To centralize their data assets to make them available to the greatest number of people to break down company silos.
But companies are still struggling to meet business needs. The speed of data production, transformation and the growing complexity of data (nature, origin, etc.) are straining the scalability capabilities of such a centralized organization. This centralized data evolves into an ocean of information where data management teams cannot respond effectively to the demands of the business and only a few expert teams can.
This is even more true in a context where companies are the result of mergers, takeovers, or are organized into subsidiaries. Building a common vision and organization between all the entities can be complex and time-consuming.
With this in mind, Zhamak Dehghani developed the concept of “Data Mesh“, proposing a paradigm shift in the management of analytical data, with a decentralized approach.
Data Mesh is indeed not a technological solution but rather a business goal, a “North Star” as Mick Lévy calls it, that must be followed to meet the challenges facing companies in the current context:
- Respond to the complexity, volatility, and uncertainty of the business
- Maintain agility in the face of growth
- Accelerate the production of value, in proportion to the investment
How the Data Catalog facilitates the implementation of a Data Mesh approach
The purpose of a data catalog is to map all of the company’s data and make it available to technical & business teams in order to facilitate their exploitation, collaboration around their uses and thus, maximize and accelerate the creation of business value.
In an organization like Data Mesh, where data is stored in different places and managed by different teams, the challenge of a data catalog is to ensure a central access point to all company data resources.
But to do this, the data catalog must support the four fundamental principles of the Data Mesh which are:
- Domain-driven ownership of data,
- Data as a product,
- Self-serve data platform
- Federated computational governance
The first principle of Data Mesh is to decentralize responsibilities around data. The company must first define business domains, in a more or less granular way, depending on its context and use cases (e.g. Production, Distribution, Logistics, etc.).
Each domain then becomes responsible for the data it produces. They each gain autonomy to manage and valorize the growing volumes of data more easily. The quality of the data is notably improved, taking advantage of any business expertise as close as possible to the source.
This approach calls into question the relevance of a centralized Master Data Management system offering a single model of the data, which is exhaustive but consequently complex to understand by data consumers and difficult to maintain over time.
Via the Data Catalog, business teams are able to rely on it to create an inventory of their data and describe their business perimeter through a model that is oriented by the specific uses of each domain.
This modeling must be accessible through a business glossary that associated with the data catalog. This business glossary, while remaining a single source of truth, must allow the different facets of the data to be reflected according to the uses and needs of each domain.
For example, if the concept of “product” is familiar to the entire company, its attributes will not be of the same interest if it is used for logistics, design or sales.
A graph-based business glossary will therefore be more appropriate because of its flexibility and its modeling and exploration capabilities that its offers compared to a predefined hierarchical approach. While ensuring the overall consistency of this semantic layer across the enterprise, a graph-based business glossary allows data managers to better take into account the specificities of their respective domains.
The data catalog must therefore enable the various domains to collaborate in defining and maintaining the metamodel and the documentation of their assets, in order to ensure their quality.
To do this, the data catalog must also offer an suitable permission management system, to allow the responsibilities to be divided up in an unambiguous way and to allow each domain manager to take charge of the documentation of their scope.
Data as a product
The second principle of the Data Mesh is to think of data not as an asset but as a product with its own user experience and lifecycle. The purpose is to avoid recreating silos in the company due to the decentralization of responsibilities.
Each domain is responsible for making one or more data products available to other domains. But beyond this company objective, thinking of data as a product allows us to have an approach centered on the expectations and needs of end users: who are the ones that consume data? in what format(s) do the users use the data? with what tools? how can we measure user satisfaction?
Indeed, with a centralized approach, companies respond to the needs of business users and scale up more slowly. Data Mesh will therefore contribute to the diffusion of the data culture by reducing the steps to take to exploit the data.
According to Zhamak Dehghani, a data product should meet different criteria, and the data catalog enables to meet some of them:
Discoverable: The first step for a data analyst, data scientist, or any other data consumer is to know what data exists and what types of insights they can exploit. The data catalog addresses this issue through an intelligent search engine that allows for keyword searching, typing or syntax errors, smart suggestions, and advanced filtering capabilities. The data catalog must also offer personalized exploration paths to better promote the various data products. Finally, the search and navigation experience in the catalog must be simple and based on market standards such as Google or Amazon, in order to facilitate the onboarding of non-technical users.
Understandable: Data must be easily understood and consumed. It is also one of the missions of the data catalog: to provide all the context necessary to understand the data. This includes a description, associated business concepts, classification, relationships with other data products, etc. Business areas can use the data catalog to make consumers as autonomous as possible in understanding their data products. A plus would be integration with data tools or sandboxes to better understand the behavior of the data.
Trustworthy: Consumers need to trust in the data they use. Here again, the data catalog will play an important role. A data catalog is not a data quality tool, but the quality indicators must be able to be retrieved and updated automatically in the data catalog in order to expose them to users (completeness, update frequency, etc.). The Data Catalog should also be able to provide statistical information on the data or reconstruct the lineage of the data, to understand the origin and the various its transformations over time.
Accessible natively: A data product should be delivered in the format expected by the different personas (data analysts, data scientists, etc.). The same data product can therefore be delivered in several formats, depending on the uses and skills of the targeted users. It should also be easy to interface with the tools they use. On this point, however, the catalog has no particular role to play.
Valuable: One of the keys to the success of a data product is that it can be consumed independently, that it is meaningful in itself. It must be designed to limit the need to make joins with other data products, in order to deliver measurable value to its consumers.
Addressable: Once the consumer has found the data product they need in the catalog, they must be able to access it or request access to it in a simple, easy and efficient way. To do so, the data catalog must be able to connect with policy enforcement systems that facilitate and accelerate access to the data by automating part of the work.
Secure: This point is related to the previous one. Users must be able to access data easily but securely, according to the policies set up for access rights. Here again, the integration of the data catalog with a policy enforcement solution facilitates this aspect.
Interoperable: In order to facilitate exchanges between domains and to, once again, avoid silos, data products must meet the standards defined at the enterprise level to easily consume any type of data product and integrate them with each other. The data catalog must be able to share the data product’s metadata to interconnect domains through APIs.
Self-serve data infrastructure
In a Data Mesh organization, the business domains are responsible for making data products available to the entire company. But to achieve this objective, the domains must have services that facilitate this implementation and automate the management tasks as much as possible: These services must make the domains as independent as possible from the infrastructure teams.
In a decentralized organization, this service layer will also help reduce costs, especially those related to the workload of data engineers; resources that are difficult to find.
The data catalog is part of this abstraction layer, allowing business domains to easily inventory the data sources for which they are responsible. To do this, the catalog must itself offer a wide-range of connectors that support the various technologies used (storage, transformation, etc.) by the domains and automate curation tasks as much as possible.
Via to easy-to-use APIs, the data catalog also enables domains to easily synchronize their business or technical repositories, connect their quality management tools, etc.
Federated computational governance
Data Mesh offers a decentralized approach to data management where domains gain some sovereignty. However, the implementation of a federated governance ensures the global consistency of governance rules, the interoperability of data products and monitoring at the scale of the Data Mesh.
The Data Office acts more as a facilitator, transmitting governance principles and policies, than as a controller. Indeed, the CDO is no longer responsible for quality or security but responsible for defining what constitutes quality, security, etc. The domain managers take over locally for the application of these principles.
This paradigm shift is possible via the automation of the application of governance policies. The application of these policies is thus accelerated compared to a centralized approach because it is done as close to the source as possible.
The data catalog can be used to share governance principles and policies that can be documented or listed in the catalog, and linked to the data products to which they apply. It will also provide metadata to the systems responsible for automating the setting up of the rules and policies.
In an increasingly complex and changing data environment, Data Mesh provides an alternative socio-architectural response to centralized approaches that struggle to scale and meet business needs for data quality and responsiveness.
The data catalog plays a central role in this organization, providing a central access portal for the discovery and sharing of data products across the enterprise, enabling business domains to easily manage their data products, and deliver the metadata to automate the policies necessary for federated governance.