In recent years, the data management and analytics landscape has witnessed a paradigm shift with the emergence of the Data Mesh framework. Coined by Zhamak Dehghani in 2019, Data Mesh is a framework that emphasizes a decentralized and domain-oriented approach to managing data. One notable discipline in the Data Mesh architecture is to treat data as a product, introducing the concept of “data products”. However, the term “data product” is often tossed around without a clear understanding of its essence. In this article, we will shed light on everything you need to know about data products and data product thinking.
Shifting to Product Thinking
For organizations to treat data as products and transform their datasets as data products, it is essential for teams to first shift to a product-thinking mindset. According to J. Majchrzak et al. in Data Mesh in Action,
Product thinking serves as a problem-solving methodology, prioritizing the comprehensive understanding of user needs and the core problem at hand before delving into the product creation process. The primary objective is to narrow the gap between user requirements and the proposed solution.
In their book, they highlight two main principles:
- Love the problem, not the solution: Before embarking on the design phase of a product, it is imperative to gain an understanding of the users and the specific problem being addressed.
- Think in products, not features: While there is a natural inclination to concentrate on adding new features and customizing assets, it is crucial to view data as a product that directly satisfies user needs.
Therefore, before unveiling a dataset, adhering to product thinking involves posing essential questions:
- What is the problem that you want to solve?
- Who will use your data product?
- Why are you doing this? What is the vision behind it?
- What is your strategy? How will you do it?
Here are some examples of answers to these questions from an excerpt of Data Mesh in Action:
What is the problem that you want to solve? Currently, the production cost statement data is used for direct billing between the production team and finance team. The data file also has costs assigned to categories. This information could be used for more complex analysis and cost comparisons across categories of different productions. Therefore, making this data more widely available for complex analysis makes sense.
Who will use your product? The data analyst will use it to manually analyze and compile production costs and forecast budgets for new productions. The data engineer will use it to import data into the analytical solution.
Why are you doing this? What is the vision behind it? We will create a dedicated and customized solution to analyze the data for production costs and planning activities. Data engineers can use the original files to import historical data.
Read the full excerpt here: https://livebook.manning.com/book/data-mesh-in-action/chapter-5/37
Data Product Definition
The philosophy of product thinking, therefore, urges us to view a data product through a long-term, entailing ongoing development, an adaptation based on user feedback, and a commitment to continuous improvement and quality. And to define a product: an object, system, or service made available for consumer use as of the consumer demand. So what makes a data product a data product?
At Zeenea, we define a Data Product as a set of value-driven data assets specifically designed and managed to be consumed quickly and securely while ensuring the highest level of quality, availability, and compliance with regulations and internal policies.
According to Data Mesh in Action, the deliberate use of the term “product” in the context of a data mesh is intentional and stands in contrast to the commonly used term “project” in organizational initiatives. It is important to underscore that the creation of a data product is not synonymous with a project. As mentioned in Products Over Projects by Sriram Narayan, projects are temporal endeavors aimed at achieving specific goals, with a defined endpoint that may not necessarily lead to continuity.
Fundamental Characteristics of a Data Product
In How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Zhamak Dehghani says a data product must exhibit the following essential characteristics:
Discoverable:
Ensuring the easy discoverability of a data product is imperative. A widely adopted approach involves implementing a registry or data catalog containing comprehensive meta-information such as owners, source of origin, lineage, and sample datasets for all available data products.
This centralized discoverability enables data consumers, engineers, and scientists within an organization to locate datasets of interest effortlessly.
Addressable:
Once discovered, a data product should possess a unique address following a global convention for programmable access. Organizations, influenced by the storage and format of their data, may adopt diverse naming conventions. In pursuit of user-friendly accessibility, common conventions become imperative in a decentralized architecture.
Trustworthy and Truthful:
Data product owners must commit to Service Level Objectives regarding the truthfulness of data, requiring a shift from traditional error-prone extractions. Employing techniques such as data cleansing and automated integrity testing during the data product’s creation is crucial to ensure an acceptable level of quality.
Self-Describing Semantics and Syntax:
High-quality data products demand a user experience without the need for handholding—they should be independently discoverable, understandable, and consumable. To construct datasets as products with minimal friction for data engineers and scientists, it is essential to articulate the semantics and syntax of the data thoroughly.
Inter-Operable and Governed by Global Standards:
Correlating data across domains in a distributed architecture relies on adherence to global standards and harmonization rules. Governance on standardizations, including field formatting, polysemes identification, address conventions, metadata fields, and event formats, ensures interoperability and meaningful correlation.
Secure and governed by a global access control
Securing access to product datasets is imperative, whether the architecture is centralized or decentralized. In the realm of decentralized, domain-oriented data products, access control operates at a more nuanced level—specifically tailored for each domain data product. Just as operational domains centrally define access control policies, these policies are applied dynamically when accessing individual dataset products. Leveraging an Enterprise Identity Management system, often facilitated through Single Sign-On (SSO), and employing Role-Based Access Control (RBAC) policies, provides a convenient and effective approach to implement access control for product datasets.
Examples of Data Products
A potential data product can take various forms, with different data representations that offer value to users. Here are several examples of technologies containing data products:
- Recommendation Engines: Platforms like Netflix, Amazon, and Spotify use recommendation engines as data products to suggest content or products based on user behavior and preferences.
- Predictive Analytics Models: Models predicting customer churn, sales forecasts, or equipment failures are examples of data products that provide valuable insights for decision-making.
- Fraud Detection Systems: Financial institutions deploy data products to detect and prevent fraudulent activities by analyzing transaction patterns and identifying anomalies.
- Personalized Marketing Campaigns: Targeted advertising and personalized marketing campaigns utilize data products to tailor content based on user demographics, behavior, and historical interactions.
- Healthcare Diagnostics Tools: Diagnostic tools that analyze medical data, such as patient records and test results, to assist healthcare professionals in making accurate diagnoses.