The Role of Data Catalogs in Accelerating AI Initiatives

The Role of Data Catalogs in Accelerating AI Initiatives

In today’s data-driven landscape, organizations increasingly rely on AI to gain insights, drive innovation, and maintain a competitive edge. Indeed, AI technologies, including machine learning, natural language processing, and predictive analytics, transform businesses’ operations, enabling them to make smarter decisions, automate processes, and uncover new opportunities. However, the success of AI initiatives depends significantly on the quality, accessibility, and efficient management of data.

This is where the implementation of a data catalog plays a crucial role.

By facilitating data governance, discoverability, and accessibility, data catalogs enable organizations to harness the full potential of their AI projects, ensuring that AI models are built on a solid foundation of accurate and well-curated data.

First: What is a data catalog?

 

A data catalog is a centralized repository that stores metadata—data about data—allowing organizations to manage their data assets more effectively. This metadata collected by various data sources is automatically scanned to enable catalog users to search for their data and get information such as the availability, freshness, and quality of a data asset.

Therefore, by definition, a data catalog has become a standard for efficient metadata management and data discovery. At Zeenea, we broadly define a data catalog as being:

A detailed inventory of all data assets in an organization and their metadata, designed to help data professionals quickly find the most appropriate data for any analytical business purpose.

How does implementing a data catalog boost AI initiatives in organizations?

 

Now that we’ve briefly defined what a data catalog is, let’s discover how data catalogs can significantly boost AI initiatives in organizations:

Enhanced Data Discovery

 

The success of AI models is determined by the ability to access and utilize large, diverse datasets that accurately represent the problem domain. A data catalog enables this success by offering robust search and filtering capabilities, allowing users to quickly find relevant datasets based on criteria such as keywords, tags, data sources, and any other semantic information provided. These Google-esque search features enable data users to efficiently navigate the organization’s data landscape and find the assets they need for their specific use cases.

For example, a data scientist working on a predictive maintenance model for manufacturing equipment can use a data catalog to locate historical maintenance records, sensor data, and operational logs. This enhanced data discovery is crucial for AI projects, as it enables data scientists to identify and retrieve the most appropriate datasets for training and validating their models.

 

💡The Zeenea difference: Get highly personalized discovery experiences with Zeenea! Our platform enables data consumers to enjoy a unique discovery experience via personalized exploratory paths by ensuring that the user profile is taken into account when ranking the results in the catalog. Our algorithms also give smart recommendations and suggestions on your assets day after day.

 

View our data discovery features.

Improved Data Quality and Trustworthiness

 

The underlying data must be of high quality for AI models to deliver accurate and reliable results. High-quality data is crucial because it directly impacts the model’s ability to learn and make predictions that reflect real-world scenarios. Poor-quality data can lead to incorrect conclusions and unreliable outputs, negatively affecting business decisions and outcomes.

A data catalog typically includes features for data profiling and data quality assessment. These features help identify data quality issues such as missing values, inconsistencies, and outliers, which can skew AI model results. By ensuring that only clean and trustworthy data is used in AI initiatives, organizations can enhance the reliability and performance of their AI models.

 

💡The Zeenea difference: Zeenea uses GraphQL and knowledge graph technologies to provide a flexible approach to integrating best-of-breed data quality solutions into our catalog. Sync the datasets of your third-party DQM tools via simple API operations. Our powerful Catalog API capabilities will automatically update any modifications made in your tool directly within our platform.

 

View our data quality features.

Improved data governance and compliance

 

Data governance is critical for maintaining data integrity, security, and compliance with regulatory requirements. It involves the processes, policies, and standards that ensure data is managed and used correctly throughout its lifecycle. Regulatory requirements such as the GDPR in Europe and the CCPA in California, United States are examples of stringent laws that organizations must adhere to.

In addition, data governance promotes transparency, accountability, and traceability of data, making it easier for stakeholders to spot errors and mitigate risks associated with flawed or misrepresented AI insights before they negatively impact business operations or damage the organization’s reputation. Data catalogs support these governance initiatives by providing detailed metadata, including data lineage, ownership, and usage policies.

For AI initiatives, robust data governance means data can be used responsibly and ethically, minimizing data breaches and non-compliance risks. This protects the organization legally and ethically and builds trust with customers and stakeholders, ensuring that AI initiatives are sustainable and credible.

 

💡The Zeenea difference: Zeenea guarantees regulatory compliance by automatically identifying, classifying, and managing personal data assets at scale. Through smart recommendations, our solution detects personal information. It suggests which assets to tag – ensuring that information about data policies and regulations is well communicated to all data consumers within the organization in their daily activities.

 

View our data governance features.

Collaboration and knowledge sharing

 

AI projects often involve cross-functional teams, including data scientists, engineers, analysts, and business stakeholders. Data catalogs are pivotal in promoting collaboration by serving as a shared platform where team members can document, share, and discuss data assets. Features such as annotations, comments, and data ratings enable users to contribute their insights and knowledge directly within the data catalog. This functionality fosters a collaborative environment where stakeholders can exchange ideas, provide feedback, and iterate on data-related tasks.

For example, data scientists can annotate datasets with information about data quality or specific characteristics functional for machine learning models. Engineers can leave comments regarding data integration requirements or technical considerations. Analysts can rate the relevance or usefulness of different datasets based on their analytical needs.

 

💡The Zeenea difference: Zeenea provides discussion tabs for each catalog object, facilitating effective communication between Data Stewards and data consumers regarding their data assets. Shortly, data users will also be able to provide suggestions regarding the content of their assets, ensuring continuous improvement and maintaining the highest quality of data documentation within the catalog.

Common understanding of enterprise-wide AI terms

 

Data catalogs often incorporate a business glossary, a centralized repository for defining and standardizing business terms and data & AI definitions across an organization. A business glossary enhances alignment between business stakeholders and data practitioners by establishing clear definitions and ensuring consistency in terminology.

This clarity is essential in AI initiatives, where precise understanding and interpretation of data are critical for developing accurate models. For example, a well-defined business glossary allows data scientists to quickly identify and utilize the right data sets for training AI models, reducing the time spent on data preparation and increasing productivity. By facilitating a common understanding of data across departments, a business glossary accelerates AI development cycles and empowers organizations to derive meaningful insights from their data landscape.

 

💡The Zeenea difference: Zeenea provides data management teams with a unique place to create their categories of semantic concepts, organize them in hierarchies, and configure the way glossary items are mapped with technical assets.

 

View our Business Glossary features.

In conclusion

 

In the rapidly evolving landscape of AI-driven decision-making, data catalogs have emerged as indispensable tools for organizations striving to leverage their data assets effectively. They ensure that AI initiatives are built on a foundation of high-quality, well-governed, well-documented data, which is essential for achieving accurate insights and sustainable business outcomes.

As organizations continue to invest in AI capabilities, adopting robust data catalogs will play a pivotal role in maximizing the value of data assets, driving innovation, and maintaining competitive advantage in an increasingly data-centric world.

[SERIES] Data Shopping Part 1 – How to Shop for Data Products

[SERIES] Data Shopping Part 1 – How to Shop for Data Products

Just as shopping for goods online involves selecting items, adding them to a cart, and choosing delivery and payment options, the process of acquiring data within organizations has evolved in a similar manner. In the age of data products and data mesh, internal data marketplaces enable business users to search for, discover, and access data for their use cases.

In this series of articles, get an excerpt from our Practical Guide to Data Mesh and discover all there is to know about data shopping as well as Zeenea’s Data Shopping experience in its Enterprise Data Marketplace:

  1. How to shop for data products
  2. The Zeenea Data Shopping experience

 

—–

 

As mentioned above, all classic marketplaces offer a very similar “checkout” experience, which is familiar to many people. The selected products are placed in a cart, and then, when validating the cart, the buyer is presented with various delivery and payment options.

The actual delivery is usually done outside the marketplace, providing tracking functionalities. Delivery can be immediate (for digital products) or deferred (for physical products). Some marketplaces have their own logistics system, but most of the time, delivery is the responsibility of the seller. The delivery time is an important element of customer satisfaction – the shorter it is, the more satisfied users are.

How does this shopping experience translate into an Enterprise Data Marketplace? To answer this question, we need to consider what data delivery means in a business context and, for that, focus on the data consumer.

The delivery of data products

 

A data product offers one or more consumption protocols – these are its outbound ports. These protocols may vary from one data product to another, depending on the nature of the data – real-time data, for example, may offer a streaming protocol, while more static data may offer an SQL interface (and instructions for using this interface from various programming languages or in-house visualization tools).

For interactive consumption needs, such as in an application, the data product may also offer consumption APIs, which in turn may adhere to a standard (REST, GraphQL, OData, etc.). Or simply download the data in a file format.

Some consumers may integrate the data product into their own pipelines to build other data products or higher-level uses. Others may simply consume the data once, for example, to train an ML model. It is up to them to choose the protocol best suited to their use case.

Whatever protocols are chosen, they all have one essential characteristic: they are secure. This is one of the universal rules of governance – access to data must be controlled, and access rights supervised.

With few exceptions, the act of purchase therefore simply involves gaining access to the data via one of the consumption protocols.

Access Rights Management for Data Products

 

However, in the world of data, access management is not a simple matter, and for one elementary reason: consuming data is a risky act.

Some data products can be desensitized – somehow removing personal or sensitive data that poses the greatest risk. But this desensitization cannot be applied to the entire product portfolio: otherwise, the organization forfeits the opportunity to leverage data that is nonetheless highly valuable (such as sensitive financial or HR data, commercial data, market data, customer personal data, etc.). In one way or another, access control is therefore a critical activity for the development and widespread adoption of the data mesh.

In the logic of decentralization of the data mesh, risk assessment and granting access tokens should be carried out by the owner of the data product, who ensures its governance and compliance. This involves not only approving the access request but also determining any data transformations needed to conform to a particular use. This activity is known as policy enforcement.

Evaluating an access request involves analyzing three dimensions:

  • The data themselves (some carry more risk than others) – the what.
  • The requester, their role, and their location (geographical aspects can have a strong impact, especially at the regulatory level) – the who.
  • The purpose – the why.

Based on this analysis, the data may be consumed as is, or they may require transformation before delivery (data filtering, especially for data not covered by consent, anonymization of certain columns, obfuscation of others, etc.). Sometimes, additional formalities may need to be completed – for example, joining a redistribution contract for data acquired from a third party, or complying with retention and right-to-forget policies, etc.

Technically, data delivery can take various forms depending on the technologies and protocols used to expose them.

For less sensitive data, simply granting read-only access may suffice – this involves simply declaring an additional user. For sensitive data, fine-grained permission control is necessary, at the column and row levels. Most modern data platforms support native mechanisms to apply complex access rules through simple configuration – usually using data tags and a policy enforcement engine. Setting up access rights involves creating the appropriate policy or integrating a new consumer into an existing policy. For older technologies that do not support sufficiently granular access control, it may be necessary to create a specific pipeline to transform the data to ensure compliance, store them in a dedicated space, and grant the consumer access to that space.

This is, of course, a lengthy and potentially costly approach, which can be optimized by migrating to a data platform supporting a more granular security model or by investing in a third-party policy enforcement solution that supports the existing platform.

Data Shopping in an Internal Data Marketplace

 

In the end, in a data marketplace, data delivery, which is at the heart of the consumer experience, translates into a more or less complex workflow, but its main stages are as follows:

  • The consumer submits an access request – describing precisely their intended use of the data.
  • The data owner evaluates this request – in some cases, they may rely on risk or regulatory experts or require additional validations – and determines the required access rules.
  • An engineer in the domain or in the “Infra & Tooling” team sets up the access – this operation can be more or less complex depending on the technologies used.

Shopping for the consumer involves triggering this workflow from the marketplace.

For Zeenea’s marketplace, we have chosen not to integrate this workflow directly into the solution but rather to interface with external solutions.

In our next article, discover the Zeenea Data Shopping experience and the technological choices that set us apart.

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

 

Written by Guillaume Bodet, co-founder & CPTO at Zeenea, our guide was designed to arm you with practical strategies for implementing data mesh in your organization, helping you:

✅ Start your data mesh journey with a focused pilot project
✅ Discover efficient methods for scaling up your data mesh
✅ Acknowledge the pivotal role an internal marketplace plays in facilitating the effective consumption of data products
✅ Learn how Zeenea emerges as a robust supervision system, orchestrating an enterprise-wide data mesh

Signature Data Mesh En
What is Data Sharing: benefits, challenges, and best practices

What is Data Sharing: benefits, challenges, and best practices

In the ever-evolving data and digital landscape, data sharing has become essential to drive business value. Indeed, across industries and domains, organizations and individuals are harnessing the power of data sharing to unlock insights, drive collaboration, and fuel growth. By exchanging diverse enterprise data products, stakeholders can gain valuable perspectives, uncover hidden trends, and make informed decisions that drive tangible impact.

However, the landscape of data sharing has its complexities and challenges. From ensuring data security and privacy to navigating regulatory compliance, stakeholders must navigate many considerations to foster a culture of responsible data sharing.

In this article, learn everything you need to know about data sharing and how an Enterprise Data Marketplace can enhance your internal data sharing initiatives.

The definition of data sharing

 

Data sharing, as its name implies, refers to the sharing of data among diverse stakeholders. Beyond the act of sharing itself, data sharing entails a commitment to maintaining the integrity and reliability of the shared data throughout its lifecycle. This means not only making data accessible to all stakeholders but also ensuring that it retains its quality, coherence, and usefulness for the processing and analysis by data consumers. A crucial part of this process involves data producers carefully documenting and labeling sets of data, including providing detailed descriptions and clear definitions so that others can easily find, discover, and understand the shared data.

In addition, data sharing implies making data accessible to the relevant individuals, domains, or organizations using robust access controls and permissions. This ensures that only authorized personnel can access specific data sets, thus adhering to regulatory compliance demands and mitigating risks associated with breaches and data misuse.

Internal vs. External data sharing

 

In the landscape of modern business operations, we must distinguish between internal and external data sharing, with their different approaches for organizations to disseminate information.

Internal data sharing is all about the exchange of information within the confines of an organization. The focus is on breaking down silos and ensuring that all parts of the organization can access the data they need, when they need it, within a secure environment. Internal sharing can be facilitated with an enterprise data marketplace, but we’ll come to this later.

External data sharing, in contrast, extends beyond the organization’s boundaries to include partners, clients, suppliers, and regulatory bodies. Given its nature, external data sharing is subject to stricter regulatory compliance and security measures, necessitating robust protocols to protect sensitive information and maintain trust between the organization and its external stakeholders.

The benefits of data sharing

 

Data sharing entails many benefits for organizations. Some of them including:

Increase collaboration

 

By facilitating data sharing within your enterprise, you foster improved collaboration among internal teams, partners, and different branches of your organization. When companies share pertinent information, all stakeholders benefit from a deeper understanding of critical aspects such as market trends, customer preferences, successful strategies, and insightful analyses. This shared data empowers teams to collaborate more effectively on joint projects, research endeavors, and development initiatives.

In addition, through the exchange of data both internally and externally, organizations can collectively explore innovative ideas and alternative approaches, drawing insights and expertise from diverse sources. This collaborative environment nurtures a culture of experimentation and creativity, ultimately driving the generation of solutions and advancements across a spectrum of industries and domains.

Finally, one real-life example of the benefits of external data sharing can be seen in the healthcare industry through initiatives like Health Information Exchanges (HIEs). HIEs are networks that facilitate the sharing of electronic health records among healthcare providers, hospitals, clinics, and other medical facilities. By sharing patient information securely and efficiently, HIEs enable healthcare providers to access comprehensive medical histories, diagnostic test results, medication lists, and other vital information about patients, regardless of where they received care.

Boost productivity

 

Data sharing significantly boosts productivity by facilitating access to critical information. When organizations share data internally among teams or externally with partners and stakeholders, it eliminates silos and enables employees to access relevant information quickly and efficiently. This eradicates the laborious endeavor of digging through disparate systems or awaiting data retrieval from others.

Moreover, data sharing acts against duplicate and redundant information, fostering awareness of existing data assets, dashboards, and other enterprise data products through shared knowledge. By minimizing redundant tasks, data sharing not only diminishes errors but also optimizes resource allocation, empowering teams to concentrate on value-added initiatives.

Enhance data trust & quality

 

Data sharing plays a critical role in improving data trust and quality in various ways. When data is shared among different stakeholders, it undergoes thorough validation and verification processes. This scrutiny by multiple parties allows for the identification of inconsistencies, errors, or inaccuracies, ultimately leading to enhancements in data accuracy and reliability.

Furthermore, shared data encourages peer review and feedback, facilitating collaborative efforts to refine and improve the quality of the information. This ongoing iterative process instills confidence in the precision and dependability of the shared data.

Additionally, data sharing often involves adhering to standardized protocols and quality standards. Through the standardization of formats, definitions, and metadata, organizations ensure coherence and consistency across datasets, thereby maintaining data quality and enabling interoperability.

Finally, within established data governance frameworks, data sharing initiatives establish clear policies, procedures, and best practices for responsible data management. Robust auditing and monitoring mechanisms are employed to track data access and usage, empowering organizations to enforce access controls and uphold data integrity with confidence.

The challenges of data sharing

Massive volumes of data

 

Sharing large datasets over networks can pose significant challenges due to the time-consuming nature of the process and the demand for substantial bandwidth. This often leads to slow transfer speeds and potential congestion on the network. Additionally, storing massive volumes of shared data requires extensive storage capacity and infrastructure resources. Organizations must allocate sufficient storage space to accommodate large datasets, which can result in increased storage costs and infrastructure investments.

Moreover, processing and analyzing massive volumes of shared data can strain computational resources and processing capabilities. To effectively manage the complexity and scale of large datasets, organizations must deploy robust data processing frameworks and scalable computing resources. These measures are essential for ensuring efficient data analysis and interpretation while navigating the intricacies of vast datasets.

Robust security measures

 

Ensuring data security poses a significant challenge in the realm of data sharing, demanding careful attention and robust protective measures to safeguard sensitive information effectively. During data sharing processes, information traversing networks and platforms becomes vulnerable to various security threats, including unauthorized access attempts, data breaches, and malicious cyber-attacks. To uphold the confidentiality, integrity, and availability of shared data, stringent security protocols, encryption mechanisms, and access controls must be implemented across all aspects of data sharing initiatives.

Compliance requirements

 

Another notable challenge of data sharing is maintaining data privacy and compliance with regulatory requirements. As organizations share data with external partners, stakeholders, or third-party vendors, they must navigate complex privacy laws and regulations governing the collection, storage, and sharing of personal or sensitive information. Compliance with regulations such as GDPR in the European Union, HIPAA (Health Insurance Portability and Accountability Act) in the healthcare industry, and CCPA (California Consumer Privacy Act) in California is crucial to avoid legal liabilities and penalties.

Data sharing best practices

 

To counter these challenges, here are some best practices:

Implement clear governance policies

 

Establishing clear data governance policies is crucial for enabling effective data sharing within organizations. These policies involve defining roles, responsibilities, and procedures for managing, accessing, and sharing data assets. By designating data stewards, administrators, and users with specific responsibilities, organizations ensure accountability and oversight throughout the data lifecycle.

Moreover, standardized procedures for data collection, storage, processing, and archival play a pivotal role in promoting consistency and efficiency in data governance practices. By standardizing these procedures, organizations can ensure that data is handled consistently and systematically across departments and teams.

Define data sharing protocols

 

Defining clear protocols and guidelines for data sharing within and outside the organization is vital for promoting transparency, accountability, and compliance.

Organizations must establish precise criteria and conditions for data sharing, including defining the purposes, scope, and intended recipients of shared data. Any limitations or restrictions on data usage, redistribution, or modification should be clearly outlined to ensure alignment with organizational objectives and legal mandates. The implementation of encryption, access controls, and data anonymization techniques ensures the secure transmission and storage of shared data, enhancing overall data security measures.

Furthermore, the development of formal data sharing agreements and protocols is essential for governing data exchange activities with external partners or stakeholders. These agreements delineate the rights, responsibilities, and obligations of each party involved in the data sharing process, covering aspects such as data ownership, confidentiality, intellectual property rights, and liability.

Implement a data marketplace

 

A data marketplace serves as a centralized hub where organizations can easily share and access data resources. By consolidating diverse datasets from various sources, it streamlines the process of discovering and acquiring relevant data.

Moreover, a data marketplace fosters collaboration and innovation by connecting data providers with consumers across different industries. Organizations can effortlessly share their data assets on the marketplace, while data consumers gain access to a vast array of data to enrich their insights and strategies.

In addition, a data marketplace prioritizes data governance and compliance by upholding standards and regulations related to data privacy, security, and usage. It provides robust tools and features for managing data access, permissions, and consent, ensuring that data sharing activities align with legal and regulatory requirements.

Start your data sharing journey with Zeenea

 

Zeenea provides internal data sharing capabilities through its Enterprise Data Marketplace (EDM), where each domain within the organization manages its own dedicated federated data catalog, providing the flexibility to share key objects such as Data Products, AI models, dashboards, Glossaries, and more with the rest of the organization. Our platform empowers data producers to seamlessly administer their catalog, users, permissions, and identify the objects they wish to share with other data domains.

Why is a Data Catalog essential for Data Product Management?

Why is a Data Catalog essential for Data Product Management?

Data Mesh is one of the hottest topics in the data space. In fact, according to a recent BARC Survey, 54% of companies are planning to implement or are implementing the Data Mesh in their companies. Implementing Data Mesh architecture in your enterprise means incorporating a domain-centric approach to data and treating data as a product. Data Product Management is therefore crucial in the Data Mesh transformation process. Eckerson Group Survey 2024 found that 70% of organizations have or are in the process of implementing Data Products.

However, many companies are struggling to manage, maintain, and get value out of their data products. Indeed, successful Data Product Management requires establishing the right people, processes, and technologies. One of those essential technologies is a data catalog.

In this article, discover how a data catalog empowers data product management in data-driven companies.

Quick definition of a Data Product

 

In a previous article on Data Products, we detailed the definition and characteristics of Data Products. At Zeenea, we define a Data Product as being:

“A set of value-driven data assets specifically designed and managed to be consumed quickly and securely while ensuring the highest level of quality, availability, and compliance with regulations and internal policies.”

Let’s get a refresher on the characteristics of a Data Product. According to Zhamak Dehghani, the Data Mesh guru, to deliver the best user experience for data consumers, data products need to have the following basic qualities:

  • Discoverable
  • Addressable
  • Trustworthy and truthful
  • Self-describing semantics and syntax
  • Inter-operable and governed by global standards
  • Secure and governed by a global access control

How can you ensure your sets of data meet the criteria for becoming a functional and value-driven Data Product? This is where a data catalog comes in.

What exactly is a data catalog?

 

Many definitions exist of what a data catalog is. At Zeenea, we define it as “A detailed inventory of all data assets in an organization and their metadata, designed to help data professionals quickly find the most appropriate data for any analytical business purpose.” Basically, a data catalog’s goal is to create a comprehensive library of all company data assets, including their origins, definitions, and relations to other data. And like a catalog for books in a library, data catalogs make it easy to search, find, and discover data.

Therefore, in an ecosystem where volumes of data are multiplying and changing by the second, it is crucial to implement a data cataloging solution – a data catalog answers the who, what, when, where, and why of your data.

But, how does this relate to data products? As mentioned in our previous paragraph, data products have fundamental characteristics that they must meet to be considered data products. Most importantly, they must be understandable, accessible, and made available for consumer use. Therefore, a data catalog is the perfect solution for creating and maintaining data products.

View our Data Catalog capabilities

A data catalog makes data products discoverable

 

A data catalog collects, indexes, and updates data and metadata from all data sources into a unique repository. Via an intuitive search bar, data catalogs make it simple to find data products by typing simple keywords.

In Zeenea, our data catalog enables data users to not only find their data products but to fully discover their context, including their origin and transformations over time, their owners, and most importantly, to which other assets it is linked for a 360° data discovery. Zeenea was designed so users can always discover their data products, even if they don’t know what they are searching for. Indeed, our platform offers unique and personalized exploratory paths so users can search and find the information they need in just a few clicks.

View our Data Discovery capabilities

A data catalog makes data products addressable

 

Once a data consumer has found the data product, they must be able to access it or request access to it in a simple, easy, and efficient way. Although a data catalog doesn’t play a direct role in addressability, it certainly can facilitate and automate part of the work. An automated Data Catalog solution plugs into policy enforcement solutions, accelerating data access (if the user has the appropriate permissions).

A data catalog makes data products trustworthy

 

At Zeenea, we strongly believe that a data catalog is not a data quality tool. However, our catalog solution automatically retrieves and updates quality indicators from third-party data quality management systems. With Zeenea, users can view their quality metrics via a user-friendly graph and instantly identify the quality checks that were performed, their quantity, and whether they passed, failed, or issued warnings. In addition, our Lineage capabilities provide statistical information on the data and reconstruct the lineage of the data product, making it easy to understand the origin and the various transformations over time. These features combined increase trust in data and ensure data users are always working with accurate data products.

View our Data Compliance capabilities

A data catalog makes data products understandable

 

One of the most significant roles of a data catalog is to provide all the context necessary to understand the data. By efficiently documenting data, with both technical and business documentation, data consumers can easily comprehend the nature of their data and draw conclusions from their analyses. In Zeenea, Data Stewards can easily create documentation templates for their Data Products and thoroughly document them, including detailed descriptions, associating Glossary Items, relationships with other Data Products, and more. By delivering a structured and transparent view of your data, Zeenea’s data catalog promotes the autonomous use of Data Products by data consumers in the organization.

View our Data Stewardship Capabilities

A data catalog enables data product interoperability

 

With comprehensive documentation, a data catalog facilitates data product integration across various systems and platforms. It provides a clear view of data product dependencies and relationships between different technologies, ensuring the sharing of standards across the organization. In addition, a data catalog maintains a unified metadata repository, containing standardized definitions, formats, and semantics for various data assets. In Zeenea, our platform is built on powerful knowledge graph technology that automatically identifies, classifies, and tracks data products based on contextual factors, mapping data assets to meet the standards defined at the enterprise level.

View our Knowledge Graph capabilities

A data catalog enables data product security

 

A data catalog typically includes robust access control mechanisms that allow organizations to define and manage user permissions. This ensures that only authorized personnel have access to sensitive metadata, reducing the risk of unauthorized access or breaches. In Zeenea, you create a secure data catalog, where only the right people can act on a data product’s documentation.

View our Permission management model

Start managing Data Products in Zeenea

 

Interested in learning more about how Data Product Management works in Zeenea? Get a 30-minute personalized demo with one of our experts now!

In the meantime, check out our Data Product Management feature note

 

 

5 Reasons to Enhance Your Data Catalog with an Enterprise Data Marketplace (EDM)

5 Reasons to Enhance Your Data Catalog with an Enterprise Data Marketplace (EDM)

Over the past decade, Data Catalogs have emerged as important pillars in the landscape of data-driven initiatives. However, many vendors on the market fall short of expectations with lengthy timelines, complex and costly projects, bureaucratic Data Governance models, poor user adoption rates, and low-value creation. This discrepancy extends beyond metadata management projects, reflecting a broader failure at the data management level.

The present situation reveals a disconnection between technical proficiency and business knowledge, a lack of collaboration between data producers and consumers, persistent data latency and quality issues, and unmet scalability of data sources and use cases. Despite substantial investments in both personnel and technology, companies find themselves grappling with a stark reality – the failure to adequately address business needs.

The good news, however, is that this predicament can be reversed by embracing an Enterprise Data Marketplace (EDM) and leveraging existing investments.

Introducing the Enterprise Data Marketplace

 

An EDM is not a cure-all, but rather a transformative solution. It necessitates companies to reframe their approach to data, introducing a new entity – Data Products. A robust Data Mesh, as advocated by Zhamak Dehghani in her insightful blog post, becomes imperative, with the EDM serving as the experiential layer of the Data Mesh.

However, the landscape has evolved with a new breed of EDM – a Data Sharing Platform integrated with a robust federated Data Catalog:

 

EDM = Data Sharing Platform + Strong Data Catalog

 

This is precisely what Zeenea accomplishes, and plans to enhance further, with our definition of an EDM:

An Enterprise Data Marketplace is an e-commerce-like solution, where Data Producers publish their Data Products, and Data Consumers explore, understand, and acquire these published Data Products.

The Marketplace operates atop a Data Catalog, facilitating the sharing and exchange of the most valuable Domain Data packaged as Data Products.

Why complement your Data Catalog with an Enterprise Data Marketplace?

 

We’ve compiled 5 compelling reasons to enhance your Data Catalog with an Enterprise Data Marketplace.

Reason #1: Streamline the Value Creation Process

 

By entrusting domains with the responsibility of creating Data Products, you unlock the wealth of knowledge possessed by business professionals and foster a more seamless collaboration with Data Engineers, Data Scientists, and Infrastructure teams. Aligned with shared business objectives, the design, creation, and maintenance of valuable, ready-to-use Data Products will collectively adopt a Product Design Thinking mindset.

Within this framework, teams autonomously organize themselves, streamlining ceremonies for the incremental delivery of Data Products, bringing fluidity to the creation process. As Data Products incorporate fresh metadata to guide Data Consumers on their usage, an EDM assumes a pivotal role in shaping and exploring metadata related to Data Products – essentially serving as the Experience Plane within the Data Mesh framework.

By adhering to domain-specific nuances, there is a notable reduction in both the volume and type of metadata, alongside a more efficient curation process. In such instances, a robust EDM, anchored by a potent Data Catalog like Zeenea, emerges as the core engine. This EDM not only facilitates the design of domain-specific ontologies but also boasts automated harvesting capabilities from both on-premises and cloud-based data sources. Moreover, it empowers the federation of Data Catalogs to implement diverse Data Mesh topologies and grants end-users an effortlessly intuitive eCommerce-like Data Shopping experience.

Reason #2: Rationalize Existing Investments

 

By utilizing an EDM (alongside a powerful Data Catalog), existing investments in modern data platforms and people can be significantly enhanced. Eliminating intricate data pipelines, where data often doesn’t need to be moved, results in substantial cost savings. Similarly, cutting down on complex, numerous, and unnecessary synchronization meetings with cross-functional teams leads to considerable time savings.

Therefore, a focused approach is maintained by the federated governance body, concentrating solely on Data Mesh-related activities. This targeted strategy optimizes resource allocation and accelerates the creation of incremental, delegated Data Products, reducing the Time to Value.

To ensure measurable outcomes, closely monitoring the performance of Data Products with accurate KPIs becomes paramount – This proactive measure enhances decision-making and contributes to the delivery of tangible results.

Reason #3: Achieve Better Adoption Than with a Data Catalog Only

 

An EDM, coupled with a powerful Data Catalog, plays a pivotal role in facilitating adoption. At the domain level, it aids in designing and curating domain-specific metadata easily understood by Domain Business Users. This avoids the need for a “common layer”, a typical pitfall in Data Catalog adoption. At the Mesh Level, it offers means to consume Data Products effectively, providing information on location, version, quality, state, provenance, platform, schema, etc. A dynamic domain-specific metamodel, coupled with strong search and discovery capabilities, makes the EDM a game-changer.

The EDM’s added value lies in provisioning and access rights, integrating with ticketing systems, dedicated Data Policy Enforcement platforms, and features from Modern Data platform vendors – a concept termed Computational Data Governance.

Reason #4: Clarify Accountability and Monitor Value Creation Performance

 

Applying Product Management principles to Data Products and assigning ownership to domains brings clarity to responsibilities. Each domain becomes accountable for the design, production, and life cycle management of its Data Products. This focused approach ensures that roles and expectations are well-defined.

The EDM then opens up Data Products to the entire organization, setting standards that domains must adhere to. This exposure helps maintain consistency and ensures that Data Products align with organizational goals and quality benchmarks.

In the EDM framework, companies establish tangible KPIs to monitor the business performance of Data Products. This proactive approach enables organizations to assess the effectiveness of their data strategies. Additionally, it empowers Data Consumers to contribute to the evaluation process through crowd-sourced ratings, fostering a collaborative and inclusive environment for feedback and improvement.

Reason #5: Apply Proven Lean Software Development Principles to Data Strategy

 

The creation of Data Products follows a similar paradigm to the Lean Software Development principles that revolutionized digital transformation. Embracing principles like eliminating waste, amplifying learning, deciding late, delivering fast, and building quality is integral to the approach that a Data Mesh can enable.

In this context, the EDM acts as a collaborative platform for teams engaged in the creation of Data Products. It facilitates:

 

  • Discovery Features: Offering automatic technical curation of data types, lineage information, and schemas, enabling the swift creation of ad hoc products.
  • Data Mesh-Specific Metadata Curation: The EDM incorporates automatic metadata curation capabilities specifically tailored for Data Mesh, under the condition that the Data Catalog has federation capabilities.
  • 360 Coverage of Data Products Information: Ensuring comprehensive coverage of information related to Data Products, encompassing their design and delivery aspects.

In essence, the collaboration between an Enterprise Data Marketplace and a powerful Data Catalog not only enhances the overall data ecosystem but also brings about tangible benefits by optimizing investments, reducing unnecessary complexities, and improving the efficiency of the data value creation process.

Everything you need to know about Data Products

Everything you need to know about Data Products

In recent years, the data management and analytics landscape has witnessed a paradigm shift with the emergence of the Data Mesh framework. Coined by Zhamak Dehghani in 2019, Data Mesh is a framework that emphasizes a decentralized and domain-oriented approach to managing data. One notable discipline in the Data Mesh architecture is to treat data as a product, introducing the concept of “data products”. However, the term “data product” is often tossed around without a clear understanding of its essence. In this article, we will shed light on everything you need to know about data products and data product thinking.

Shifting to Product Thinking

 

For organizations to treat data as products and transform their datasets as data products, it is essential for teams to first shift to a product-thinking mindset. According to J. Majchrzak et al. in Data Mesh in Action,

Product thinking serves as a problem-solving methodology, prioritizing the comprehensive understanding of user needs and the core problem at hand before delving into the product creation process. The primary objective is to narrow the gap between user requirements and the proposed solution.

In their book, they highlight two main principles:

  • Love the problem, not the solution: Before embarking on the design phase of a product, it is imperative to gain an understanding of the users and the specific problem being addressed.
  • Think in products, not features: While there is a natural inclination to concentrate on adding new features and customizing assets, it is crucial to view data as a product that directly satisfies user needs.

Therefore, before unveiling a dataset, adhering to product thinking involves posing essential questions:

 

  • What is the problem that you want to solve?
  • Who will use your data product?
  • Why are you doing this? What is the vision behind it?
  • What is your strategy? How will you do it?

Here are some examples of answers to these questions from an excerpt of Data Mesh in Action:

What is the problem that you want to solve? Currently, the production cost statement data is used for direct billing between the production team and finance team. The data file also has costs assigned to categories. This information could be used for more complex analysis and cost comparisons across categories of different productions. Therefore, making this data more widely available for complex analysis makes sense.

Who will use your product? The data analyst will use it to manually analyze and compile production costs and forecast budgets for new productions. The data engineer will use it to import data into the analytical solution.

Why are you doing this? What is the vision behind it? We will create a dedicated and customized solution to analyze the data for production costs and planning activities. Data engineers can use the original files to import historical data.

Read the full excerpt here: https://livebook.manning.com/book/data-mesh-in-action/chapter-5/37

Data Product Definition

 

The philosophy of product thinking, therefore, urges us to view a data product through a long-term, entailing ongoing development, an adaptation based on user feedback, and a commitment to continuous improvement and quality. And to define a product: an object, system, or service made available for consumer use as of the consumer demand. So what makes a data product a data product?

At Zeenea, we define a Data Product as a set of value-driven data assets specifically designed and managed to be consumed quickly and securely while ensuring the highest level of quality, availability, and compliance with regulations and internal policies.

According to Data Mesh in Action, the deliberate use of the term “product” in the context of a data mesh is intentional and stands in contrast to the commonly used term “project” in organizational initiatives. It is important to underscore that the creation of a data product is not synonymous with a project. As mentioned in Products Over Projects by Sriram Narayan, projects are temporal endeavors aimed at achieving specific goals, with a defined endpoint that may not necessarily lead to continuity.

Fundamental Characteristics of a Data Product

 

In How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Zhamak Dehghani says a data product must exhibit the following essential characteristics:

Discoverable:

 

Ensuring the easy discoverability of a data product is imperative. A widely adopted approach involves implementing a registry or data catalog containing comprehensive meta-information such as owners, source of origin, lineage, and sample datasets for all available data products.
This centralized discoverability enables data consumers, engineers, and scientists within an organization to locate datasets of interest effortlessly.

Addressable:

 

Once discovered, a data product should possess a unique address following a global convention for programmable access. Organizations, influenced by the storage and format of their data, may adopt diverse naming conventions. In pursuit of user-friendly accessibility, common conventions become imperative in a decentralized architecture.

Trustworthy and Truthful:

 

Data product owners must commit to Service Level Objectives regarding the truthfulness of data, requiring a shift from traditional error-prone extractions. Employing techniques such as data cleansing and automated integrity testing during the data product’s creation is crucial to ensure an acceptable level of quality.

Self-Describing Semantics and Syntax:

 

High-quality data products demand a user experience without the need for handholding—they should be independently discoverable, understandable, and consumable. To construct datasets as products with minimal friction for data engineers and scientists, it is essential to articulate the semantics and syntax of the data thoroughly.

Inter-Operable and Governed by Global Standards:

 

Correlating data across domains in a distributed architecture relies on adherence to global standards and harmonization rules. Governance on standardizations, including field formatting, polysemes identification, address conventions, metadata fields, and event formats, ensures interoperability and meaningful correlation.

Secure and governed by a global access control

 

Securing access to product datasets is imperative, whether the architecture is centralized or decentralized. In the realm of decentralized, domain-oriented data products, access control operates at a more nuanced level—specifically tailored for each domain data product. Just as operational domains centrally define access control policies, these policies are applied dynamically when accessing individual dataset products. Leveraging an Enterprise Identity Management system, often facilitated through Single Sign-On (SSO), and employing Role-Based Access Control (RBAC) policies, provides a convenient and effective approach to implement access control for product datasets.

Examples of Data Products

 

A potential data product can take various forms, with different data representations that offer value to users. Here are several examples of technologies containing data products:

 

  • Recommendation Engines: Platforms like Netflix, Amazon, and Spotify use recommendation engines as data products to suggest content or products based on user behavior and preferences.
  • Predictive Analytics Models: Models predicting customer churn, sales forecasts, or equipment failures are examples of data products that provide valuable insights for decision-making.
  • Fraud Detection Systems: Financial institutions deploy data products to detect and prevent fraudulent activities by analyzing transaction patterns and identifying anomalies.
  • Personalized Marketing Campaigns: Targeted advertising and personalized marketing campaigns utilize data products to tailor content based on user demographics, behavior, and historical interactions.
  • Healthcare Diagnostics Tools: Diagnostic tools that analyze medical data, such as patient records and test results, to assist healthcare professionals in making accurate diagnoses.
Zeenea Product Recap: A look back at 2023

Zeenea Product Recap: A look back at 2023

2023 was another big year for Zeenea. With more than 50 releases and updates to our platform, these past 12 months were filled with lots of new and improved ways to unlock the value of your enterprise data assets. Indeed, our teams consistently work on features that simplify and enhance the daily lives of your data and business teams.

In this article, we’re thrilled to share with you some of our favorite features from 2023 that enabled our customers to:

  • Decrease data search and discovery time
  • Increase Data Steward productivity & efficiency
  • Deliver trusted, secure, and compliant information across the organization
  • Enable end-to-end connectivity with all their data sources

Decrease data search and discovery time

 

One of Zeenea’s core values is simplicity. We strongly believe that data discovery should be quick and easy to accelerate data-driven initiatives across the entire organization.

In fact, many data teams still struggle to find the information they need for a report or use case. Either because they couldn’t locate the data because it was scattered in various sources, files, or spreadsheets, or maybe they were confronted with an overwhelming amount of information they didn’t even know how to begin their search.

In 2023, we designed our platform with simplicity. By providing easy and quick ways to explore data, Zeenea enabled our customers to find, discover, and understand their assets in seconds.

A fresh new look for the Zeenea Explorer

 

One of the first ways our teams wanted to enhance the discovery experience of our customers was by providing a more user-friendly design to our data exploration application, Zeenea Explorer. This redesign included:

New Homepage

 

Our homepage needed a brand-new look and feel for a smoother discovery experience. Indeed, for users who don’t know what they are looking for, we added brand-new exploration paths directly accessible via the Zeenea Explorer homepage.

 

  • Browsing by Item Type: If the user is sure of the type of data asset they are looking for, such as a dataset, visualization, data process, or custom asset, they directly access the catalog with it pre-filtered with the needed type of asset.
  • Browsing through the Business Glossary: Users can quickly navigate through the enterprise’s Business Glossary by directly accessing the Glossary assets that were defined or imported by stewards in Zeenea Studio.
  • Browsing by Topic: The app enables users to browse through a list of Items that represent a specific theme, use case, or anything else that is relevant to business (more information below).
New Zeenea Explorer Homepage 2023

New Item Detail Pages

 

To understand a catalog Item at a glance, one of the first notable changes was the position of the Item’s tabs. The tabs were originally positioned on the left-hand side of the page, which took up a lot of space. Now, the tabs are at the top of the page, more closely reflecting the layout of the Studio app. This new layout allows data consumers to find the most significant information about an Item such as:

  • The highlighted properties, defined by the Data Steward in the Catalog Design,
  • Associated Glossary terms, to understand the context of the Item,
  • Key people, to quickly reach the contacts that are linked to the Item.

In addition, our new layout allows users to find all fields, metadata, and all other related items instantly. Divided into three separate tabs in the old version, data consumers now find the Item’s description and all related Items in a single “Details” tab. Indeed, depending on the Item Type you are browsing through, all fields, inputs & outputs, parent/children Glossary Items, implementations, and other metadata are in the same section, saving you precious data discovery time.

Lastly, the spaces for our graphical components were made larger – users now have more room to see their Item’s lineage, data model, etc.

New Item Detail Page Zeenea Explorer

New Filtering system

 

Zeenea Explorer offers a smart filtering system to contextualize search results. Zeenea’s preconfigured filters can be used such as by item type, connection, contact, or by the organization’s own custom filters. For even more efficient searches, we redesigned our search results page and filtering system:

 

  • Available filters are always visible, making it easier to narrow down the search,
  • By clicking on a search result, an overview panel with more information is always available without losing the context of the search,
  • The filters most relevant to the search are placed at the top of the page, allowing to quickly get the results needed for specific use cases.
New Filtering System Explorer

Easily browsing the catalog by Topic

 

One major 2023 release was our Topics feature. Indeed, to enable business users to (even more!) quickly find their data assets for their use cases, Data Stewards can easily define Topics in Zeenea Studio. To do so, they simply select the filters in the Catalog that represent a specific theme, use case, or anything else that is relevant to business.

Data teams using Zeenea Explorer can therefore easily and quickly search through the catalog by Topic to reduce their time searching for the information they need. Topics can be directly accessed via the Explorer homepage and the search bar when browsing the catalog.

Browse By Topic Explorer New

Alternative names for Glossary Items for better discovery

 

In order for users to easily find the data and business terms they need for their use cases, Data Stewards can add synonyms, acronyms, and abbreviations for Glossary Items!

Ex: Customer Relationship Management > CRM

Alternative Names Zeenea Studio

Improved search performance

 

Throughout the year, we implemented a significant amount of improvements to enhance the efficiency of the search process. The addition of stop words, encompassing pronouns, articles, and prepositions, ensures a more refined and pertinent outcome for queries. Moreover, we added an “INFIELD:” operator, enabling users the capability to search for Datasets that contain a specific field.

Search In Fields Explorer

Microsoft Teams integration

 

Zeenea also strengthened our communication and collaboration capacities. Specifically, when a contact is linked to a Microsoft email address, Zeenea now facilitates the initiation of direct conversations via Teams. This integration allows Teams users to promptly engage with relevant individuals for additional information on specific Items. Other integrations with various tools are in the works. ⭐️

Microsoft Teams Zeenea Explorer

Increase Data Steward productivity & efficiency

 

Our goal at Zeenea is to simplify the lives of data producers so they can efficiently manage, maintain, and enrich the documentation of their enterprise data assets in just a few clicks. Here are some features and enhancements that help to stay organized, focused, and productive.

Automated Datasets Import

 

When importing new Datasets in the Catalog, administrators can turn on our Automatic Import feature which automatically imports new Items after each scheduled inventory. This time-saving enhancement increases operational efficiency, allowing Data Stewards to focus on more strategic tasks rather than the routine import process.

Auto Import Zeenea Studio 2

Orphan Fields Deletion

 

We’ve also added the to manage Orphan Fields more effectively. This includes the option to perform bulk deletions of Orphan Fields, accelerating the process of decluttering and organizing the catalog. Alternatively, Stewards can delete a single Orphan Field directly from its detailed page, providing a more granular and precise approach to catalog maintenance.

Orphan Field Details

Building reports based on the content of the catalog

 

We added a new section in Zeenea Studio – The Analytics Dashboard – to easily create and build reports based on the content and usage of the organization’s catalog.

Directly on the Analytics Dashboard page, Stewards can view the completion level of their Item Types, including Custom Items. Each Item Type element is clickable to quickly view the Catalog section filtered by the selected Item Type.

For more detailed information on the completion level of a particular Item Type, Stewards can create their own analyses! They select the Item Type and a Property, and they’re able to consult, and for each value of this property, the completion level of all your Item’s template, including its description, and linked Glossary Items.

New Analytics Dashboard Gif Without Adoption

New look for the Steward Dashboard

 

Zeenea Explorer isn’t the only application that got a makeover! Indeed, to help Data Stewards stay organized, focused, and productive, we redesigned the Dashboard layout to be more intuitive to get work done faster. This includes:

 

  • New Perimeter design: A brand new level of personalization when logging in to the Dashboard. The perimeter now extends beyond Dataset completion – it includes all the Items that one is a Curator for, including Fields, Data Processes, Glossary Items, and Custom Items.
  • Watchlists Widget: Just as Data Stewards create Topics for enhanced organization for Explorer users, they can now create Watchlists to facilitate access to Items requiring specific actions. By filtering the catalog with the criteria of their choice, Data Stewards save these preferences as new Watchlists via the “Save filters as” button, and directly access them via the Watchlist widget when logging on to their Dashboard.
  • The Latest Searches widget: Caters specifically to the Data Steward, focusing on their recent searches to enable them to pick up where they left off.
    The Most Popular Items widget: The most consulted and widely used Items within the Data Steward’s Perimeter by other users. Each Item is clickable, giving instant access to its contents.

 

View the Feature Note

 

New Steward Dashboard Studio

Deliver trusted, secure, and compliant information across the organization

Data Sampling on Datasets

 

For select connections, it is possible to get Data Sampling for Datasets. Our Data Sampling capabilities allow users to obtain representative subsets of existing datasets, offering a more efficient approach to working with large volumes of data. With Data Sampling activated, administrators can configure fields to be obfuscated, mitigating the risk of displaying sensitive personal information.

This feature carries significant importance to our customers, as it enables users to save valuable time and resources by working with smaller, yet representative, portions of extensive datasets. This also allows early identification of data issues, thereby enhancing overall data quality and subsequent analyses. Most notably, the capacity to obfuscate fields addresses critical privacy and security concerns, allowing users to engage with anonymized or pseudonymized subsets of sensitive data, ensuring compliance with privacy regulations, and safeguarding against unauthorized access.

Data Sampling Zeenea Studio

Powerful Lineage capabilities

 

In 2022, we made a lot of improvements to our Lineage graph. Not only did we simplify its design and layout, but we also made it possible for users to display only the first level of lineage, expand and close the lineage on demand, and get a highlighted view of the direct lineage of a selected Item.

This year we made significant other UX changes, including the possibility to expand or reduce all lineage levels in one click, hide the data processes that don’t have at least one input and one output, and easily view the connections via a tooltip for connections that have long names.

However, the most notable release is the possibility to have Field-level lineage! Indeed, it is now possible to retrieve the input and output Fields of tables and reports, and for more context, add the operation’s description. Then, users can directly view their Field level transformations over time in the Data Lineage graph in both Zeenea Explorer and Zeenea Studio.

Field Level Lineage Zeenea Studio 2

Data Quality Information on Datasets

 

By leveraging GraphQL and knowledge graph technologies, Zeenea Data Discovery Platform provides a flexible approach to integrating best-of-breed data quality solutions. It synchronizes datasets via simple query and mutation operations from third-party DQM tool via our Catalog API capabilities. The DQM tool will deliver real-time data quality scan results to the corresponding dataset within Zeenea, enabling users the ability to conveniently review data quality insights directly within the catalog.

This new feature includes:

  • A Data Quality tab in your Dataset’s detail pages, where users can view its Quality checks as well as the type, status, description, last execution date, etc.
  • The possibility to view more information on the Dataset’s quality directly in the DQM tool via the “Open dashboard in [Tool Name]” link.
  • A data quality indicator of Datasets directly displayed in the search results and lineage.

 

View the Feature Note

Zeenea Explorer Data Quality Graph

Enable end-to-end connectivity with all their data sources

 

At Zeenea, connect to all your data sources in seconds. Our platform’s built-in scanners and APIs enable organizations to automatically collect, consolidate, and link metadata from their data ecosystem. This year, we made significant enhancements to our connectivity to enable our customers to build a platform that truly represents their data ecosystem.

Catalog Management APIs

 

Recognizing the importance of API integration, Zeenea has developed powerful API capabilities that enable organizations to seamlessly connect and leverage their data catalog within their existing ecosystem.

In 2023, Zeenea developed Catalog APIs, which help Data Stewards with their documentation tasks. These Catalog APIs include:

Query operations to retrieve specific catalog assets: Our API query operations include the retrieval of a specific asset, using its unique reference or by its name & type, or retrieving a list of assets via connection or a given Item type. Indeed, Zeenea’s Catalog APIs enable flexibility when querying by being able to narrow results to not be overwhelmed with a plethora of information.

Mutation operations to create and update catalog assets: To save even more time when documenting and updating company data, Zeenea’s Catalog APIs enable data producers to easily create, modify, and delete catalog assets. It enables the creation, update, and deletion of Custom Items and Data Processes as well as their associated metadata, and update Datasets and Data Visualizations. This is also possible for Contacts. This is particularly important when users leave the company or change roles – data producers can easily transfer the information that was linked to a particular person to another.

 

Read the Feature Note

Property & Responsibility Codes management

 

Another feature that was implemented was the ability to add code to properties & responsibilities to easily use them in API scripts for more reliable queries & retrievals.

For all properties and responsibilities that were built in Zeenea (e.g.: Personally Identifiable Information) or harvested from connectors, it is possible to modify its name and description to better suit the organization’s context.

Property Responsibility Codes Studio

More than a dozen more connectors to the list

 

At Zeenea, we develop advanced connectors to automatically synchronize metadata between our data discovery platform and all your sources. This native connectivity saves you the tedious and challenging task of manually finding the data you need for a specific business use case that often requires access to scarce technical resources.

In 2023 alone, we developed over a dozen new connectors! This achievement underscores our agility and proficiency in swiftly integrating with diverse data sources utilized by our customers. By expanding our connectivity options, we aim to empower our customers with greater flexibility and accessibility.

 

View our connectors

[Press Release] Consulting firm Amura IT announces a new strategic alliance with Zeenea

[Press Release] Consulting firm Amura IT announces a new strategic alliance with Zeenea

This agreement marks a significant milestone in the strategy of both companies by combining Amura IT’s expertise in developing solutions in its three lines of business: Digital, Intelligence, and AI, with Zeenea’s innovative data catalog and data discovery platform.

Madrid, July 19th, 2023 – Zeenea, a leading company in active metadata management and data discovery, has reached an agreement with consulting firm Amura IT to sign a strategic alliance.

Amura IT, recognized for its over 20 years of experience as a technology group, has demonstrated its ability to help companies in their digital transformation. It has three distinct service lines: Amura IA, a recently created line, Amura Digital to provide a satisfying customer and business experience, adapted to the new digital era, and finally Amura Intelligence, whose objective is to assist companies in their transformation with Analytics and Big Data solutions. It is within this line of business where the alliance with Zeenea would primarily fall.

The integration of Zeenea’s intuitive technology will enable Amura IT to offer data cataloging and data discovery solutions to a larger number of users, regardless of their technical knowledge. At the same time, this partnership will allow Zeenea to expand its reach to a broader user base locally in Spain.

Richard Mathis, VP of Sales at Zeenea and director of business development in Spain and Portugal, expressed:

“We are delighted to join forces with Amura IT, a leading specialist in technology consulting and digital transformation in the Spanish and Portuguese markets. The majority of Zeenea’s operations are already taking place overseas, and this alliance will support our efforts and commitment to bring a strong local presence and expertise to these key southern European markets. We look forward to building strong and valuable relationships with Amura IT and organizations eager to unlock the full potential of their data.

José Antonio Fernández, director of business development and alliances at Amura IT, highlighted the strategic importance of this partnership:

“The alliance with Zeenea is a decisive step in our strategy, as it will allow us to enhance our capabilities and offer innovative solutions focused on metadata cataloging, search, and lineage. Our clients will be able to access the past, present, and future of their data assets in an agile and effective manner.”

With the signing of this agreement, Zeenea and Amura IT demonstrate their shared commitment to providing cutting-edge solutions and helping companies maximize the value of their enterprise data. This strategic partnership promises to drive digital transformation and improve data-driven decision-making for their clients in Spain and expand beyond our borders.

 

About Amura IT

Amura IT is a technology consulting firm that was established in 2019 to offer solutions that support digital transformation in organizations. It currently offers three lines of business: Amura Intelligence, to assist companies in their transformation with Analytics and Big Data solutions primarily; Amura Digital, to provide a satisfying customer and business experience adapted to the new digital era; and Amura IA, focused on solutions based on artificial intelligence.

Key takeaways from the Zeenea Exchange 2023: Unlocking the power of the enterprise Data Catalog

Key takeaways from the Zeenea Exchange 2023: Unlocking the power of the enterprise Data Catalog

Each year, Zeenea organizes exclusive events that bring together our clients and partners from various organizations fostering an environment for collaborative discussions and the sharing of experiences and best practices. The third edition of the Zeenea Exchange France was held in the heart of Paris’ 8th arrondissement with our French-speaking customers and partners, whereas the first edition of the Zeenea Exchange International was an online event gathering clients from all around the world.

In this article, we will take a look back and give key insights into what was discussed during these client round tables – both taking place in June 2023 – with the following topic: “What are your current & future uses and objectives for your data catalog initiatives?”.

What motivated our clients for implementing a data catalog solution?

Exploding volumes of information

 

Most of our clients faced the challenge of having to collect and inventory large amounts of information from different sources. Indeed, a significant number of our participants embarked on their data-driven journey by adopting a Data Lake or other platform to store their information. However, they soon realized that it became difficult to manage this large ocean of data with questions such as “What data do I have? Where do they come from? Who is responsible for this data? Do I have the right to see this data? What does this data mean?” began to arise.

Consequently, finding a solution that could automate the centralization of enterprise information and provide accurate insights about their data became a crucial objective, leading to the search for a data catalog solution.

Limited data access

 

Another common data challenge that arose was that of data access. Prior to centralizing their data assets into a common repository, many companies faced the issue of disparate information systems, dedicated to different business lines or departments within the organization. Therefore, data were kept in silos, making it difficult, even impossible, for efficient reporting or communication around their information. The need for making data available to all was another key reason why our clients searched for a solution that could democratize data access to those who need to access it.

Unclear roles & responsibilities

 

Another major reason for searching for a data catalog was to give clear roles and responsibilities to their different data consumers and producers. The purpose of a data catalog is to centralize and maintain up-to-date contact information for each data asset, providing clear visibility on the appropriate person or entity to approach when questions arise regarding a specific set of data.

What are the current uses & challenges regarding their data catalog initiatives?

A lack of a common language

 

Creating a shared language for data definitions and business concepts is a significant challenge faced by many of our clients. This issue is particularly prevalent when different business lines or departments lack alignment in defining specific concepts or KPIs. For example, some KPIs may lack clear definitions or multiple versions of the same KPI may exist with different definitions. Given some of our clients’ complex data landscapes, achieving alignment among stakeholders regarding the meaning and definitions of concepts poses significant challenges and remains a crucial endeavor.

More autonomy to the business users

 

The implementation of a data catalog has brought a significant increase in autonomy for business users across the majority of our clients. By utilizing Zeenea, which offers intuitive search and data discovery capabilities across the organization’s data landscape, non-technical users now have a user-friendly and efficient means to locate and utilize data for their reports and use cases. One client in the banking industry expressed how the data catalog accelerated the search, discovery, and acquisition of data. Overall it improved data understanding, facilitated access to existing data, and enhanced the overall quality analysis process, thereby instilling greater trust in the data for users.

Catalog adoption remains difficult

 

Another significant challenge faced by some of our clients is the difficulty in promoting data catalog adoption and fostering a data-driven culture. This resistance can be attributed to many users being unfamiliar with the benefits that the data catalog can provide. Establishing a data-driven culture requires dedicated efforts to explain the advantages of using a data catalog. This can be accomplished by promoting it to different departments through effective communication channels, organizing training sessions, and highlighting small successes that demonstrate the value of the tool throughout the organization.

The benefits of automation

 

The data catalog offers a valuable feature of automating time-consuming tasks related to data collection, which proves to be a significant strength for many of our clients. Indeed, Zeenea’s APIs enable the retrieval of external metadata from different sources, facilitating the inventorying of glossary terms, ownership roles information, technical and business quality indicators from data quality tools, and more. Furthermore, it was expressed that the data catalog aids in expediting IT transformation programs and the integration of new systems, enabling better planning for new integrations.

The next steps in their data cataloging journey

Towards a Data Mesh approach

 

Some of our customers, particularly those who attended the International Edition, have shown interest in adopting a Data Mesh approach. According to a conducted poll during the event, 66% of the respondents are either considering or currently deploying a Data Mesh approach within their organizations. A particular client shared they have data warehouses and Data Lakes, but the lack of transparency regarding data ownership and usage within different domains prompted the need for more autonomy and a shift from a centralized Data Lake to a domain-specific architecture.

Zeenea as a central repository

 

Many of our clients, regardless of their industry or size, leverage the data catalog as a centralized repository for their enterprise data. This approach helps them consolidate information from multiple branches or subsidiaries into a single platform, avoiding duplicates and ensuring data accuracy. Indeed, the data catalog’s objective is to enable our clients to find data across departments, facilitating the use of shared solutions and enhancing data discovery and understanding processes.

Using the data catalog for compliance initiatives

 

Compliance initiatives are indeed gaining importance for organizations, particularly in industries such as banking and insurance. In a poll conducted during the International Edition, we found that 50% of the respondents currently use the data catalog for compliance purposes, while the other 50% may consider using it in the future.

A client who expressed that compliance is a priority shared that they are considering building an engine to query and retrieve information about the data they have on an individual if requested. Others have future plans to leverage the data catalog for compliance and data protection use cases. They aim to classify data and establish a clear understanding of its lineage, enabling them to track where data originates, how it flows, and where it is utilized.

If any of this feedback and testimonials echo your day-to-day experience within your company, please don’t hesitate to contact us. We’d be delighted to welcome you to the Zeenea user community and invite you to our next Zeenea Exchange events.

Zeenea Revolutionizes Data Discovery with NLP Search – OpenAI Integration

Zeenea Revolutionizes Data Discovery with NLP Search – OpenAI Integration

Zeenea is happy to announce the integration of Natural Language Processing (NLP) search capabilities in our Data Discovery Platform! This groundbreaking feature allows users to interact with Zeenea’s search engine using everyday language, making data exploration more intuitive and efficient.

Let’s explore how this innovation empowers users to obtain accurate and relevant results from their data searches.

How was Zeenea’s NLP Search Integration achieved?

 

In order to accomplish this functionality, Zeenea leveraged the potential of OpenAI’s APIs and the advanced language processing capabilities of GPT-3.5. Zeenea’s engineers designed a prompt that effectively converts natural language questions into search queries and filters.
And voilà! As a result, users enjoy a smooth and effortless experience, as the search engine comprehends and responds to queries in a human-expert manner.

Some examples of NLP searches in Zeenea

 

Zeenea’s NLP search functionality opens up a world of possibilities for users to interact with their data catalog more effortlessly. Here are a few examples of the questions you can now ask in Zeenea’s search engine:

→ “Please find all datasets holding customer data in the central Data Lake.”
→ “Please list all duplicated datasets in the catalog.”
→ “Where can I find an analysis of our historical customer retention performance?”

These queries showcase the flexibility and convenience of communicating with Zeenea using natural language. Whether you prefer a casual tone or a more professional approach, Zeenea’s search engine understands your intent and delivers accurate results.

Nlp Zeenea Explorer

 

A feature still in development

 

Although the NLP search feature is currently in an experimental phase, Zeenea is actively collaborating with select customers to ensure its accuracy and relevance in various contexts. Indeed, Zeenea’s dynamic knowledge graph structure necessitates extensive real-world testing to fine-tune the system to provide the best possible experience to our users.

On the road to AI-Driven Data Discovery

 

Zeenea’s dedication to innovation goes beyond NLP search. We are exploring several AI-powered features that promise to revolutionize the data discovery landscape. Some of the exciting developments include:

  • An Interactive chatbot: The development of an interactive chatbot that could offer an alternative conversational search experience so users can engage in natural conversations to obtain relevant information and insights.
  • Automated generation & correction of business definitions: Zeenea aims to expedite catalog sourcing and enhance the quality of the glossary by automatically generating and correcting domain-specific business definitions.
  • Automatic summarization of descriptions: An automatic summarization that would enable users to grasp essential information quickly by condensing lengthy descriptions into concise summaries, ultimately saving time and improving their data comprehension.
  • Improved Auto-Classification and Data Tagging Suggestions: Zeenea’s AI algorithms are being enhanced to provide more accurate auto-classification and data tagging suggestions.

…and more!

 

Stay tuned for more exciting developments from Zeenea as they revolutionize the data discovery landscape.

The top 5 benefits of data lineage

The top 5 benefits of data lineage

Do you have the ambition to turn your organization into a data-driven enterprise? You cannot escape the need to accurately map all your data assets, monitor their quality and guarantee their reliability. Data lineage can help you accomplish this mission. Here are some explanations.

To know what data you use, what it means, where it comes from, and how reliable it is throughout its life cycle, you need a holistic view of everything that is likely to transform, modify or alter it. This is exactly the mission that data lineage fulfills, which is a data analysis technique that allows you to follow the path of data from its source to its final use. A technique that has many benefits!

Benefit #1: Improved data governance

 

Data governance is a key issue for your business and for ensuring that your data strategy can deliver its full potential. By following the path of data – from its collection to its exploitation – data lineage allows you to understand where it comes from and the transformations it has undergone over time to create a rich and contextualized data ecosystem. This 360° view of your data assets guarantees reliable and quality data governance.

Benefit #2: More reliable, accurate, and quality data

 

As mentioned above, one of the key strengths of data lineage is its ability to trace the origin of data. However, another great benefit is its ability to identify the errors that occur during its transformation and manipulation. Hence, you are able to take measures to not only correct these errors but also ensure that they do not reoccur, ultimately improving the quality of your data assets. A logic of continuous improvement that is particularly effective for the success of your data strategy.

Benefit #3: Quick impact analysis

 

Data lineage accurately identifies data flows, making sure you never stay in the wrong for too long. The first phase is based on the detailed knowledge of your business processes and your available data sources. When critical data flows are identified and mapped, it is possible to quickly analyze the potential impacts of a given transformation on data or a business process. With the impacts of each data transformation assessed in real-time, you have all the information you need to identify the ways and means to mitigate the consequences. Visibility, traceability, reactivity – data lineage saves you precious time!

Benefit #4: More context to the data

 

As you probably understood by now, data lineage continuously monitors the course of your data assets. Therefore, beyond the original source of the data, you have full visibility of the transformations that have been applied to the data throughout its journey. This visibility also extends to the use that is made of the data within your various processes or through the
applications deployed in your organization. This ultra-precise tracking of the history of interactions with data allows you to give more context to data in order to improve data quality, facilitate analysis and audits, and make more informed decisions based on accurate and complete information.

Benefit #5: Build (even more!) reliable compliance reports

 

The main expectations of successful regulatory compliance are transparency and traceability. This is the core value promise of data lineage. By using data lineage, you have all the cards in your hand to reduce compliance risks, improve data quality, facilitate audits and verifications, and reinforce stakeholders’ confidence in the compliance reports produced.

Enabling Data Literacy: 5 Ways a Data Catalog is Key

Enabling Data Literacy: 5 Ways a Data Catalog is Key

In today’s data-driven world, organizations from all industries are collecting vast amounts of data from various sources, including IoT, applications, and social media. This data explosion has created new opportunities for businesses to gain insights into their operations, customers, and markets. However, these opportunities can only be realized if organizations have a data-literate workforce that can understand and use data effectively.

Indeed, data literacy refers to the ability to read, understand, analyze, and interpret data. It is a crucial skill for individuals and organizations to stay competitive and make data-driven decisions. In fact, according to a recent study by Accenture, organizations that prioritize data literacy are more likely to be successful in their digital transformation initiatives.

To enable data literacy, organizations need to provide their employees with easy access to high-quality data that is well-organized, well-documented, and easy to use. This is where a data catalog comes in.

In this article, discover the 5 ways a data catalog enables successful data literacy in organizations.

A quick definition of data catalog

 

At Zeenea, we define a data catalog as being an organized inventory of an organization’s data ecosystem that provides a searchable interface to find, understand, and trust their data.

Indeed, created to unify all enterprise data, a data catalog enables data managers and users to improve productivity and efficiency when working with their data. In 2017, Gartner declared data catalogs as “the new black in data management and analytics”. And in “Augmented Data Catalogs: Now an Enterprise Must-Have for Data and Analytics Leaders” they state: “The demand for data catalogs is soaring as organizations continue to struggle with finding, inventorying and analyzing vastly distributed and diverse data assets.”

A data catalog is therefore a crucial component in an organization’s data literacy journey. Let’s see how.

#1 A Data Catalog centralizes all data into a single source of truth

 

A data catalog automatically collects and updates all enterprise data from across your various sources into a single repository to help create a comprehensive view of an organization’s data landscape. By indexing your organization’s metadata, data catalogs increase data visibility and enable data and business users to easily find their information across multiple systems.

This boosts data literacy for organizations as data catalogs help break down silos between different departments and teams by providing a single, searchable repository of all available data assets. Indeed, with a data catalog, no technical expertise is required to access and understand a company’s data ecosystem: organizations can easily collaborate and share their information assets in a single platform.

#2 A Data Catalog increases data knowledge via documentation capabilities

 

Data catalogs enable the increase of enterprise-wide data knowledge via the automation of documentation capabilities. By providing data producers with documentation features, users get descriptive information about their data assets, such as their purpose, usage, and relevance to business processes. With comprehensive documentation capabilities in a data catalog, data users can easily understand and use data assets, ultimately promoting data literacy across the organization.

By ensuring that documentation is accurate, consistent, and up-to-date, organizations with data catalogs can reduce the risk of data errors and inconsistencies. This leads to more reliable data, which is essential for informed decision-making and better business outcomes.

#3 A Data Catalog provides powerful data discoverability

 

Data discovery is the process of exploring and analyzing data in order to gain insights and uncover hidden patterns or relationships. This must-have data catalog feature promotes data literacy by providing users with a better understanding of the data they are working with and encouraging them to ask questions and explore the data in more depth.

With data discoverability features, a data catalog helps users identify patterns and trends in the data. By visualizing data in different ways, users can identify correlations, outliers, and other patterns that may not be immediately apparent in raw data. This can help users to gain new insights and develop a deeper understanding of the data they are working with.

#4 A Data Catalog provides a common data vocabulary via a Business Glossary

 

A business glossary is a key component of a data catalog that provides a common language and understanding of business terms and definitions across the organization. A business glossary defines the meaning of key business terms and concepts, which enables data users to understand the context and relevance of the data they are working with.

This, in turn, promotes data literacy across the organization. Data catalogs, therefore, help data teams avoid data misunderstandings and maximize trust in enterprise data.

#5 A Data Catalog provides powerful lineage features

 

Data lineage provides a clear understanding of the origin and transformation of data, which is essential for understanding how data is used and how it relates to other data assets. This information is essential for data management initiatives, as it helps to ensure data accuracy, reliability, and compliance.

By tracing data from its source to its destination, data lineage boosts data literacy by providing users with information about the purpose of the data, the business processes that use the data, and the dependencies that exist between different data assets. This information can help users to understand the relevance and importance of the data they are working with, and how it fits into the broader context of the organization. Data lineage can also help identify any anomalies, inconsistencies, or data quality issues that may affect the accuracy or reliability of the data.

Conclusion

 

In conclusion, data catalogs are a powerful tool for promoting data literacy within organizations. By centralizing data and metadata, providing access to data lineage information, and offering data discovery capabilities, data catalogs can make it easier for users to find and understand the data they work with, and are key for a data literate organization!

Don’t let these 4 Data Nightmares scare you – Zeenea is here to help

Don’t let these 4 Data Nightmares scare you – Zeenea is here to help

You wake up with your heart pounding. Your feet are trembling – Just moments ago you were being chased by thousands of scary, poor, inaccurate, and incorrect data from your sources. As data professionals, we’ve all been there. And Data Nightmares feel all too real while experiencing them.

No worries – Zeenea is here to help! In this article, discover the most common data nightmares and how our data discovery platform acts as a dream catcher for your data terrors.

Nightmare #1 – Data is stuck in silos

 

You have reports to build, yet, the information you seek is locked away, inaccessible, and gate kept by scary bodyguards. Moreover, the people who have the key are unknown or worse, gone from the organization – making it impossible for you to access the data you need for your business use cases!

How Zeenea wakes you up: Our platform provides a single source of truth for your enterprise information – it centralizes and unifies your metadata from all your various sources, and makes it available to everyone in the organization. With Zeenea, data knowledge is no longer limited to a group of experts, boosting collaboration, increasing productivity, and maximizing data value.

Discover our Data Catalog

Nightmare #2 – Data is unreliable

 

You’re looking through your enterprise data assets and you don’t like what you see. The data is duplicated (even tripled or quadrupled), it is incomplete – or empty – obsolete, and you don’t even know where it comes from or what it is linked to… The nightmare? The long hours of data documentation that are waiting for you.

How Zeenea wakes you up: For data managers to always deliver complete, trustworthy, and quality information to their teams, Zeenea provides flexible and adaptive metamodel templates for predefined and custom data assets. Automatically import or build your assets’ documentation templates by simply dragging and dropping the properties, tags, and other fields that need to be documented for your business use cases.

⭐️ Bonus: Documentation templates can be modified whenever you want – Zeenea automatically updates existing templates with your modifications, saving you time on your documentation initiatives.

Discover our data documentation app

Nightmare #3 – Data is misunderstood

 

You were asked to find trends and patterns in order to offer more personalized experiences for your customers. However, when searching for your information, you come across multiple terms… which one is it? The people in the sales department use the term ‘client’, the Customer Success teams use ‘customer’, but over in IT they employ the term ‘user’. Without a clear business vocabulary, you are kept in the dark about your data!

How Zeenea wakes you up: Our Business Glossary enables the creation and sharing of a consistent data language across all people within the organization. Easily import or create your enterprise business terms, add a description, tags, associated contacts, and any other properties that are relevant to your use cases. Our unique Business Glossary features provide a unique place for data managers to create their categories of semantic concepts, organize them in hierarchies, and configure the way glossary items are mapped with technical assets.

Discover our Business Glossary

Nightmare #4 – Data is not compliant

With the increasing amount of data regulations that are being imposed, data security and governance initiatives have become a major priority for data-driven enterprises. Indeed, the consequences of non-conformity are very severe – large fines, reputational damage… enough to keep you from sleeping well at night.

How Zeenea wakes you up: Zeenea guarantees regulatory compliance by automatically identifying, classifying, and managing personal data assets at scale. Through smart recommendations, our platform detects personal information and gives suggestions on which assets to tag – ensuring that information on data policies and regulations is well communicated to all data consumers within the organization in their daily activities.

Discover how we support Data Compliance

Start the data journey of your dreams with Zeenea!

If you’re interested in Zeenea for your data initiatives, contact us for a 30-minute personalized demo with one of our data experts.

The state of data access in data-driven enterprises – BARC Data Culture Survey 23

The state of data access in data-driven enterprises – BARC Data Culture Survey 23

Zeenea is a proud sponsor of BARC’s Data Culture Survey 23. Get your free copy here.

In last year’s BARC Data Culture Survey 22, “data access” was selected as the most relevant aspect of BARC’s ‘Data Culture Framework’. Therefore, this year, BARC examined the current status, experiences, and plans of companies with regard to their efforts to create a positive data culture with a special emphasis on ‘data access’.

The study was based on the findings of a worldwide online survey conducted in July and August 2022. The survey was promoted within the BARC panel, as well as via websites and newsletter distribution lists. A total of 384 people took part, representing a variety of different roles, industries, and company sizes.

In this article, discover the findings regarding enterprise data access of BARC’s Data Culture Survey 23.

”Right to know” versus “Need to know” principles

53% of best-in-class* companies rely on the Right to know principle. But only 24% of laggards concur.

In their study, BARC describes two principles that can be observed regarding data access: Need to know refers to a more restrictive approach, where users must ask for authorization to access data. In contrast, the Right to know model refers to the propagation of a data democracy, where data access is free for all employees, limited only by intentionally restricted data (e.g., secret, personal, or similar data).

The Need to know approach has always been the predominant model for data access, with 63 percent of participants confirming that this approach prevails in their organization. However, significantly more than half of the sample consider the Right to know to be the most beneficial model.

For many respondents, however, there is still a significant gap between their wishes and reality. Right to know is practiced mainly by small companies. This is not surprising due to their simple and flat organizational structures and straightforward communication channels. In fact, BARC found that as the size of a company increases, so does its organizational complexity and the demands on data governance. The Need to know principle tends to prevail in this case.

Companies that predominantly practice the Right to know principle believe that they generate greater benefits from data than companies adopting Need to know. For example, they report a much higher rate of achievement when it comes to gaining a competitive advantage, preserving market position, and growing revenue.

Need To Know Versus Right To Know Barc Data Culture Survey

The technologies & tools associated with data access

It is no secret: data access requires technical support. According to BARC, around two-thirds of the companies surveyed use traditional data warehousing and BI technologies. And 69 percent use Excel and 51 percent use self-service analytics tools. These figures aren’t surprising If the objective is to solve these challenges with existing enterprise tools.

It is worth mentioning that 32 percent use code to manage data access, which corresponds to BARC’s general market perception that languages such as Python are gaining a stronger foothold in the enterprise data landscape.

In turn, the need for transparency to be able to find data, features, and algorithms in an uncomplicated manner and to integrate them securely is also increasing. Thus providing the breeding ground for software providers to offer new solutions that help to manage and monitor code in order to have a controlled and monitored process.

Technologies And Concepts Used In Organization Barc Data Culture Survey

The survey shows that there is a great deal of catching up to do in terms of technologies for data access! Fewer than 25 percent of the companies surveyed use data intelligence platforms or data catalogs. However, it is precisely these solutions that help to compile knowledge about data outside of the BI context, across systems, and make it analyzable, thus addressing the main challenges to data access.

The importance of data knowledge has been recognized above all by best-in-class* companies. 58 percent use data intelligence platforms, compared with only 19 percent of laggards*.

Laggards Versus Best In Class Technologies Used Barc Data Culture Survey

The lack of competence in new technologies

Of course, technology is only half the solution to data access problems. As mentioned in a previous article, many challenges have their origin in a lack of strategy or organization.
The added value of technologies for increasing data access is limited. Only just over half succeed in improving data access through BI and data warehouse technologies, and only one in three companies manage it with self-service analytics tools.

Data virtualization tools, data intelligence platforms, and data catalogs play a remarkable role in the technical support of data access. These tools can clearly add value, but BARC states that there is probably a lack of knowledge and training to be able to use them extensively.

Indeed, 39 percent of respondents complain about a lack of skills as the second most common obstacle to data access!

Liberalize data access & empower your data users through a strong data culture

If you’re interested in learning more about the findings of BARC’s Data Culture survey 23 & the importance of democratizing data access, download the document for free!

By downloading the survey, get insights on:

 

  • The assessment of the data access philosophies,
  • The effects of the implementation of a data culture,
  • The challenges of implementing data access,
  • And much more.

* The sample was divided into ‘best-in-class’ and ‘laggards’ in order to identify differences in terms of the current data culture within organizations. This division was made based on the question “How would you rate your company’s data culture compared to your main competitors?”. Companies that have a much better data culture than their competitors are referred to as ‘best-in-class’, while those who have a slightly or much worse data culture than their competitors are classed as ‘laggards’.

How does a Data Catalog reinforce the 4 fundamental principles of Data Mesh?

How does a Data Catalog reinforce the 4 fundamental principles of Data Mesh?

Introduction: what is data mesh?

As companies are becoming more aware of the importance of their data, they are rethinking their business strategies in order to unleash the full potential of their information assets. The challenge of storing the data has gradually led to the emergence of various solutions: data marts, data warehouses and data lakes, to enable the absorption of increasingly large volumes of data. The goal? To centralize their data assets to make them available to the greatest number of people to break down company silos.

But companies are still struggling to meet business needs. The speed of data production, transformation and the growing complexity of data (nature, origin, etc.) are straining the scalability capabilities of such a centralized organization. This centralized data evolves into an ocean of information where data management teams cannot respond effectively to the demands of the business and only a few expert teams can.

This is even more true in a context where companies are the result of mergers, takeovers, or are organized into subsidiaries. Building a common vision and organization between all the entities can be complex and time-consuming.

With this in mind, Zhamak Dehghani developed the concept of “Data Mesh“, proposing a paradigm shift in the management of analytical data, with a decentralized approach.

Data Mesh is indeed not a technological solution but rather a business goal, a “North Star” as Mick Lévy calls it, that must be followed to meet the challenges facing companies in the current context:

  • Respond to the complexity, volatility, and uncertainty of the business
  • Maintain agility in the face of growth
  • Accelerate the production of value, in proportion to the investment

 

How the Data Catalog facilitates the implementation of a Data Mesh approach

The purpose of a data catalog is to map all of the company’s data and make it available to technical & business teams in order to facilitate their exploitation, collaboration around their uses and thus, maximize and accelerate the creation of business value.

In an organization like Data Mesh, where data is stored in different places and managed by different teams, the challenge of a data catalog is to ensure a central access point to all company data resources.

But to do this, the data catalog must support the four fundamental principles of the Data Mesh which are:

  • Domain-driven ownership of data,
  • Data as a product,
  • Self-serve data platform
  • Federated computational governance

Domain ownership

The first principle of Data Mesh is to decentralize responsibilities around data. The company must first define business domains, in a more or less granular way, depending on its context and use cases (e.g. Production, Distribution, Logistics, etc.).

Each domain then becomes responsible for the data it produces. They each gain autonomy to manage and valorize the growing volumes of data more easily. The quality of the data is notably improved, taking advantage of any business expertise as close as possible to the source.

This approach calls into question the relevance of a centralized Master Data Management system offering a single model of the data, which is exhaustive but consequently complex to understand by data consumers and difficult to maintain over time.

Via the Data Catalog, business teams are able to rely on it to create an inventory of their data and describe their business perimeter through a model that is oriented by the specific uses of each domain.

This modeling must be accessible through a business glossary that associated with the data catalog. This business glossary, while remaining a single source of truth, must allow the different facets of the data to be reflected according to the uses and needs of each domain.

For example, if the concept of “product” is familiar to the entire company, its attributes will not be of the same interest if it is used for logistics, design or sales.

A graph-based business glossary will therefore be more appropriate because of its flexibility and its modeling and exploration capabilities that its offers compared to a predefined hierarchical approach. While ensuring the overall consistency of this semantic layer across the enterprise, a graph-based business glossary allows data managers to better take into account the specificities of their respective domains.

The data catalog must therefore enable the various domains to collaborate in defining and maintaining the metamodel and the documentation of their assets, in order to ensure their quality.

To do this, the data catalog must also offer an suitable permission management system, to allow the responsibilities to be divided up in an unambiguous way and to allow each domain manager to take charge of the documentation of their scope.

Data as a product

The second principle of the Data Mesh is to think of data not as an asset but as a product with its own user experience and lifecycle. The purpose is to avoid recreating silos in the company due to the decentralization of responsibilities.

Each domain is responsible for making one or more data products available to other domains. But beyond this company objective, thinking of data as a product allows us to have an approach centered on the expectations and needs of end users: who are the ones that consume data? in what format(s) do the users use the data? with what tools? how can we measure user satisfaction?

Indeed, with a centralized approach, companies respond to the needs of business users and scale up more slowly. Data Mesh will therefore contribute to the diffusion of the data culture by reducing the steps to take to exploit the data.

According to Zhamak Dehghani, a data product should meet different criteria, and the data catalog enables to meet some of them:

Discoverable: The first step for a data analyst, data scientist, or any other data consumer is to know what data exists and what types of insights they can exploit. The data catalog addresses this issue through an intelligent search engine that allows for keyword searching, typing or syntax errors, smart suggestions, and advanced filtering capabilities. The data catalog must also offer personalized exploration paths to better promote the various data products. Finally, the search and navigation experience in the catalog must be simple and based on market standards such as Google or Amazon, in order to facilitate the onboarding of non-technical users.

Understandable: Data must be easily understood and consumed. It is also one of the missions of the data catalog: to provide all the context necessary to understand the data. This includes a description, associated business concepts, classification, relationships with other data products, etc. Business areas can use the data catalog to make consumers as autonomous as possible in understanding their data products. A plus would be integration with data tools or sandboxes to better understand the behavior of the data.

Trustworthy: Consumers need to trust in the data they use. Here again, the data catalog will play an important role. A data catalog is not a data quality tool, but the quality indicators must be able to be retrieved and updated automatically in the data catalog in order to expose them to users (completeness, update frequency, etc.). The Data Catalog should also be able to provide statistical information on the data or reconstruct the lineage of the data, to understand the origin and the various its transformations over time.

Accessible natively: A data product should be delivered in the format expected by the different personas (data analysts, data scientists, etc.). The same data product can therefore be delivered in several formats, depending on the uses and skills of the targeted users. It should also be easy to interface with the tools they use. On this point, however, the catalog has no particular role to play.

Valuable: One of the keys to the success of a data product is that it can be consumed independently, that it is meaningful in itself. It must be designed to limit the need to make joins with other data products, in order to deliver measurable value to its consumers.

Addressable: Once the consumer has found the data product they need in the catalog, they must be able to access it or request access to it in a simple, easy and efficient way. To do so, the data catalog must be able to connect with policy enforcement systems that facilitate and accelerate access to the data by automating part of the work.

Secure: This point is related to the previous one. Users must be able to access data easily but securely, according to the policies set up for access rights. Here again, the integration of the data catalog with a policy enforcement solution facilitates this aspect.

Interoperable: In order to facilitate exchanges between domains and to, once again, avoid silos, data products must meet the standards defined at the enterprise level to easily consume any type of data product and integrate them with each other. The data catalog must be able to share the data product’s metadata to interconnect domains through APIs.

Self-serve data infrastructure

In a Data Mesh organization, the business domains are responsible for making data products available to the entire company. But to achieve this objective, the domains must have services that facilitate this implementation and automate the management tasks as much as possible: These services must make the domains as independent as possible from the infrastructure teams.

In a decentralized organization, this service layer will also help reduce costs, especially those related to the workload of data engineers; resources that are difficult to find.

The data catalog is part of this abstraction layer, allowing business domains to easily inventory the data sources for which they are responsible. To do this, the catalog must itself offer a wide-range of connectors that support the various technologies used (storage, transformation, etc.) by the domains and automate curation tasks as much as possible.

Via to easy-to-use APIs, the data catalog also enables domains to easily synchronize their business or technical repositories, connect their quality management tools, etc.

Federated computational governance

Data Mesh offers a decentralized approach to data management where domains gain some sovereignty. However, the implementation of a federated governance ensures the global consistency of governance rules, the interoperability of data products and monitoring at the scale of the Data Mesh.

The Data Office acts more as a facilitator, transmitting governance principles and policies, than as a controller. Indeed, the CDO is no longer responsible for quality or security but responsible for defining what constitutes quality, security, etc. The domain managers take over locally for the application of these principles.

This paradigm shift is possible via the automation of the application of governance policies. The application of these policies is thus accelerated compared to a centralized approach because it is done as close to the source as possible.

The data catalog can be used to share governance principles and policies that can be documented or listed in the catalog, and linked to the data products to which they apply. It will also provide metadata to the systems responsible for automating the setting up of the rules and policies.

Conclusion

In an increasingly complex and changing data environment, Data Mesh provides an alternative socio-architectural response to centralized approaches that struggle to scale and meet business needs for data quality and responsiveness.

The data catalog plays a central role in this organization, providing a central access portal for the discovery and sharing of data products across the enterprise, enabling business domains to easily manage their data products, and deliver the metadata to automate the policies necessary for federated governance.

The traps to avoid for a successful data catalog project –  Technical integration

The traps to avoid for a successful data catalog project – Technical integration

Metadata management is an important component in a data management project and it requires more than just the data catalog solution, however connected it may be.

A data catalog tool will of course reduce the workload but won’t in and of itself guarantee the success of the project.

In this series of articles, discover the pitfalls and preconceived ideas that should be avoided when rolling out an enterprise-wide data catalog project. The traps described in this are articulated around 4 central themes that are crucial to the success of the initiative:

  1. Data culture within the organization
  2. Internal project sponsorship
  3. Project leadership
  4. Technical integration of the Data Catalog

Integrating the data catalog into the enterprise ecosystem will provide opportunities to create value. It is essential to consider these aspects and understand the potential rewards.

Not all metadata has to be entered manually

More and more systems produce, aggregate, and enable the entering of metadata for local value. This information has to be retrieved and consolidated in the catalog, without being entered twice, for obvious reasons (saving money, data reliability, and availability).

The data catalog, therefore, presents an opportunity to consolidate this information with the knowledge of the contributors in their respective fields. However, this consolidation has to be thought out through a technical integration rather than a manual effort. Even if it’s obvious that entering the same information twice isn’t efficient, nor is carrying out imports/exports between systems through human actions the way to go.

The strength of a data catalog remains its capacity to ingest metadata via technical integration chains and thus ensure a robust synchronization between systems.

The data catalog isn’t an “automagical” tool

On the flip side, thinking that a data catalog can extract all types of metadata regardless of its source or format, would be misleading.

The catalog should of course facilitate metadata retrieval, but some metadata won’t be retrievable automatically. There will therefore always be a cost linked to the intervention of the contributors.

The first reason for this resides in the origin of some metadata: some information may simply not be present in the systems because it originates solely from the knowledge of experts. The data catalog is therefore, in this case, a potential candidate for becoming the master system and eligible to receive this information.

And conversely, some information can be present in a system and be impossible to retrieve in an automated manner…for many reasons. For example, there could be an absence of
an interface that enables information to be accessed in a stable manner. The risk of producing noise around the information is therefore high and can lead to a degradation of the quality of the catalog content and ultimately turn the users off using it.

The data catalog must not be connected to a unique metadata source

Metadata stems from many varied layers. As a result, there are multiple and complementary sources involved for a global understanding. It is precisely the reconciliation of this information in a central solution, a data catalog, that will provide the necessary elements to the users.

Opting for a connected data catalog is a real asset, because asset discovery and the associated metadata retrieval are made considerably easier as a result of automation.

This connectivity can also extend to other complementary systems. These systems can potentially come before or after the first one, enabling, if needed, the materialization of the lineage and thus documenting the flows and transformations between systems.

The systems can also be independent of one another and simply allow for, by their addition to the catalog, an exhaustive cartography of the company’s patrimony.

Lastly, given the variety of the types of assets that can be documented in the catalog, the different connected sources can also contribute to the enrichment of a specific universe in the data catalog: semantic layers for some, physical layers for others, etc.

Always with an iterative approach in mind, the multiple sources that will feed the data catalog will be integrated progressively, in accordance with a strategy that seeks the production of value, under the global supervision of the Data Office.

The 10 Traps to Avoid for a Successful Data Catalog Project

To learn more about the traps to avoid when starting a data cataloging initiative, download our free eBook!

10 Traps To Avoid For A Successful Data Catalog Project Mockup

The traps to avoid for a successful data catalog project – Data Culture

The traps to avoid for a successful data catalog project – Data Culture

Metadata management is an important component in a data management project and it requires more than just the data catalog solution, however connected it may be.

A data catalog tool will of course reduce the workload but won’t in and of itself guarantee the success of the project.

In this series of articles, discover the pitfalls and preconceived ideas that should be avoided when rolling out an enterprise-wide data catalog project. The traps described in this are articulated around 4 central themes that are crucial to the success of the initiative:

  1. Data culture within the organization
  2. Internal project sponsorship
  3. Project leadership
  4. Technical integration of the Data Catalog

Organizations with data as the sole product are very rare. While data is everywhere, it is often only a byproduct of the company’s activities. It is therefore not surprising to find that some collaborators are not as aware of its importance. Indeed, data culture isn’t innate and a lack of awareness of the importance of data can become a major obstacle to a successful data catalog deployment.

Let’s illustrate this with a few common preconceptions. 

Not all collaborators are sensitive to what is at stake with metadata management

The first obstacle is probably the lack of a global understanding of the initiative. Emphasizing the importance of metadata management to colleagues who still misunderstand the crucial role the actual data can play in an organization is doomed to fail.

It’s quite likely that a larger program that includes an awareness initiative emphasizing the stakes around enterprise data management will have to be set up. The most important element to inculcate is probably the fact that data is a common good, meaning that the owners of a dataset have the duty to make it visible and understandable to all stakeholders and colleagues.

Indeed, one of the most common obstacles in a metadata management initiative is
the resistance to the effort needed to produce and maintain documentation. This is all the more of an issue when it is felt that the potential users targeted are limited to a small group of people who already fully understand the subject. When it is understood that the target group is in fact much larger (the entire organization and potentially all staff), it becomes obvious that this knowledge has to be recorded in a “scalable” manner. 

A data catalog doesn’t do everything

A data culture-related issue can also affect those in charge of the project, although this is less common. An inaccurate understanding of the tools and their use can lead to mistakes and cause suboptimal, even detrimental, choices. The data catalog is a central software component for metadata management but it’s likely not the only tool used. It is therefore not advisable to try and do everything just with this tool. This may sound obvious but in practice, it can be difficult to identify the limits beyond which it is necessary to bring a more specialized solution into the mix.

The data catalog is the keystone to documentation and has to be the entry point for any collaborator with questions related to a concept linked to data. However, this doesn’t make it “the solution” in which everything has to be found. This nuance is important because referencing or synthesizing information doesn’t necessarily mean carrying this information wholesale.

Indeed, there are many subjects that come up during the preparation phases of a metadata management project: technical or functional modeling, data habilitation management, workflows for access requests, etc. All these topics are important, carry value, and are linked to data. However, they are not specifically destined to be managed by the solution that documents your assets.

It is therefore important to begin by identifying these requirements, defining a response strategy, and then integrating this tooling in an ecosystem larger than just the data catalog.

The 10 Traps to Avoid for a Successful Data Catalog Project

To learn more about the traps to avoid when starting a data cataloging initiative, download our free eBook!

10 Traps To Avoid For A Successful Data Catalog Project Mockup

Why Zeenea chose a Privacy by Design approach for its data catalog?

Why Zeenea chose a Privacy by Design approach for its data catalog?

Since the beginning of the 21st century, we’ve been experiencing a true digital revolution. The world is constantly being digitized and human activity is increasingly structured around data and network services. The manufacturing, leisure, administration, service, medical, and so many other industries are now organized around complex and interconnected information systems. As a result, more and more data is continuously collected by the devices and technologies present in our daily lives (Web, Smartphone, IoT) and transited from system to system. It has become central for any company that provides products or services to do everything possible to protect the data of their customers. The best approach to do so is through Privacy by Design

In this article, we explain what Privacy by Design is, how we applied this approach in the design of our data catalog, as well as how a data catalog can help companies implement Privacy by Design. 

 

Data protection: a key issue for enterprises

Among all the various data mentioned above, some allow the direct or indirect identification of physical persons. These are known as personal data, as defined by the CNIL. It is of paramount importance in the modern world because of its intrinsic value. 

On a daily basis, huge volumes of personal data pass between individuals, companies, and governments. There is a real risk of their misuse, as the Cambridge Analytica scandal in 2015 showed, for example. Cybercriminals can also make substantial gains from it, via account hacking, reselling data to other cybercriminals, identity theft, or attacking companies via phishing or president scams. For example, a real estate developer was recently robbed of several tens of millions of euros in France.

The need to protect data has never been so important.

States have quickly become aware of this issue to protect individuals from the abuses related to the exploitation of their data. In Europe, for example, the GDPR (the General Data Protection Regulation) has been in effect since 2016 and is already well established in the daily activities of companies. In the rest of the world, regulations are constantly evolving and are a concern for nearly every country. Recently, California passed a consumer data privacy law, a U.S. equivalent of the GDPR. Even China has just legislated on this topic.

 

Privacy by Design: defining a key concept for data protection

While many legislations rely heavily on the notion of Privacy by Design, it was conceptualized by Ann Cavoukian in the late 1990s when she was the Information and Privacy Commissioner of the Province of Ontario in Canada. The essence of this idea is to include the issue of personal data protection right from the design of a computer system. 

In this sense, Privacy by Design lists seven fundamental principles:

#1 – Proactivity: any company must put in place the necessary provisions for data protection upstream, and must not rely on a reactive policy;

#2 – Personal data protection as a default setting: any system must take as a default setting the highest possible level of protection for the sensitive data of its users;

#3 – Privacy by design: privacy should be a systematically studied and considered aspect of the design and implementation of new functionality;

#4 – Full functionality: no compromise should be made with security protocols or with the user experience;

#5 – End-to-end security: the system must ensure the security of data throughout its lifecycle, from collection to destruction (including if the data is outsourced);

#6 – Visibility and transparency: the system and the company must document and communicate personal data protection procedures and actions taken in a clear, consistent and transparent manner;

#7 – Respect for user privacy: every design and implementation decision must be made with the user’s interest at the center.

 

The application of Privacy by Design at Zeenea

At Zeenea, and particularly because the company was created in 2017, we’ve built our product on the foundations of Privacy by Design.

The treatment of users’ personal data

First of all, we have anchored data protection at the heart of our architecture. Each of our customer’s data is segregated into different tenants, each encrypted with their own key. User authentication is managed through a specialized third-party system. We encourage identity federation among our customers, which allows them to maintain control over the data needed for user identification and authentication.

We have also included the concept of Privacy by Design in the design of our application. For example, we collect only the bare minimum of information, all system outputs are anonymized (logs, application errors, APIs).

Processing customer business data

Our main mission being to document the data, our solution contains by essence metadata. By design, Zeenea does not extract any data from our customers’ systems. Indeed, the risk is intrinsically less on the metadata than on the data. 

Nevertheless, we offer within Zeenea, several features allowing us to provide information on the data present in the client systems (statistics, sampling, etc.). Because of our architecture, the calculations are always done on the client’s infrastructure, as close as possible to the data and its security. And in compliance with principle #2 of Privacy by Design, we have set the protection of personal data as a default setting. Thus, all these features are disabled by default and can only be activated by the customer.

 

How our data catalog helps companies implement Privacy by Design

Our data catalog can help your company implement Privacy by Design, especially on the control and verification aspects. Taking the 7 principles described earlier, the data catalog can effectively participate in two of them: the visibility and transparency principle, and the end-to-end security principle. The data catalog also enables the automation of the identification of sensitive data.

Visibility and transparency via the data catalog

The objective of a data catalog is to centralize a company’s data assets, document them, and share them with as many people as possible. This centralization allows each employee to know what data is collected by the CRM, and the marketing and customer success teams to process this information in the acquisition and churn tracking reports.

Once this inventory has been established, the catalog can be used to document certain additional information that is necessary for the company’s proper functioning. This is notably the case for the sensitive or non-sensitive nature of the documented information, the rules of governance, the processing, or the access procedures that must be applied. 

In the context of a Privacy by Design approach, the data catalog can be used to add a business term corresponding to sensitive data (a social security number, a telephone number, etc.). This business term can then be easily associated with the tables or physical fields that contain the data, thus allowing its easy identification. This initiative contributes to the principle of visibility and transparency of Privacy by Design.

End-to-end security via the data catalog

The data catalog also provides data lineage capabilities. Automatic data lineage ensures that the processes applied to data identified as sensitive comply with what is defined by the company’s data governance. It is then simple with the data catalog to fill in the governance rules to be applied to sensitive data. 

Moreover, the lineage allows us to follow the whole life cycle of the data, from its creation to its final use, including its transformations. This makes it easy to check that all the stages of this life cycle comply with the rules and correct any errors. 

The data catalog, via the data lineage, thus contributes to the principle of the end-to-end security of Privacy by Design.

With that said, at Zeenea we remain convinced that a data catalog is not a compliance solution, but rather a tool for the acculturation of teams to sensitive data and its particularities of use.

Identifying sensitive data via the data catalog

In a rapidly changing data environment, the data catalog must reflect reality as much as possible in order to maintain the trust of its users. Without this, the entire adoption of the data catalog project is put into question. 

At Zeenea, we are firmly convinced that the data catalog must be automated as much as possible to be scalable and efficient. This starts with the inventory of available data. In this sense, our inventory is automated and is responsible for passing on all modifications to the original system (source) of the data directly into the catalog. Thus, at any time, the customer has an exhaustive list of the data present in its systems. 

And to help our customers identify which of the inventoried data deserve special treatment because of their sensitive data status, the automation does not stop at the inventory. We now offer a system that suggests the tagging of new data inventoried with a sensitive data profile. This makes it easier to bring this data to the forefront and to spread the information faster and more easily throughout the company.

For more information on the technology used at Zeenea, download our eBook “The 5 technological breakthroughs of a Data Catalog“.

 

Conclusion

In the past few years, personal data has become a real concern for most consumers. More and more countries are setting up regulations to guarantee their citizens maximum protection. One of the major principles governing all these regulations is Privacy by Design.

At Zeenea, we have from the start included the reflection around personal data at the heart of our product. Both in our technical development and in the processing of our users’ data, as well as in our reflection on the data that our clients process via our catalog. 

We believe that a data catalog can be a significant asset in the implementation and monitoring of Privacy by Design policies. We also heavily rely on automation and AI to bring many more improvements in the upcoming months: automatic construction of technical data lineage, improved detection of sensitive data in the catalog objects to better document them, quality control of processes applied to sensitive data, etc. The possibilities are numerous. 

To learn more about the advantages of the catalog in the management of your sensitive and personal data, don’t hesitate to schedule a meeting with one of our experts:

The 5 product values that strengthen Zeenea’s team cohesion & customer experience

The 5 product values that strengthen Zeenea’s team cohesion & customer experience

To remain competitive, organizations must make decisions quickly, as the slightest mistake can lead to a waste of precious time in the race for success. Defining the company’s reason for being, its direction, and its strategy makes it possible to build a solid foundation for creating an alignment – subsequently facilitating decisions that impact product development. Aligning all stakeholders in product development is a real challenge for Product Managers. Yet, it is an essential mission to bring up a successful product and an obvious prerequisite to motivate teams who need to know why they get up each morning to go to work.

 

The foundations of a shared product vision within the company

Various frameworks (NorthStar, OKR, etc.) have been developed over the last few years to enable companies and their product teams to lay these foundations, disseminate them within the organization, and build a roadmap that creates cohesion. These frameworks generally define a few key artifacts and have already given rise to a large body of literature. Although versions may differ from one framework to another, the following concepts are generally found: 

  • Vision: the dream, the true North of a team. The vision must be inspiring and create a common sense of purpose throughout the organization.
  • The mission: it represents an organization’s primary objective and must be measurable and achievable.
  • The objectives: these define measurable short and medium-term milestones to accomplish the mission.
  • The roadmap: a source of shared truth – it describes the vision, direction, priorities, and progress of a product over time.

With a clear and shared definition of these concepts across the company, product teams have a solid foundation for identifying priority issues and effectively ordering product backlogs.

Product values: the key to team buy-in and alignment over time

Although well defined at the beginning, these concepts described above can nevertheless fall into oblivion after a while or become obsolete! Indeed, the company and the product evolve, teams change, and consequently the product can lose its direction… A work of reconsideration and acculturation must therefore be carried out continuously by the product teams in order for it to last.

Indeed, product development is both a sprint and a marathon! One of the main difficulties for product teams is to maintain this alignment over time. In this respect, another concept in these frameworks is often under-exploited when it is not completely forgotten by organizations: product values. 

Jeff Steiner, Executive Chairman at LinkedIn, particularly emphasized the importance of defining company values through the Vision to Values framework. LinkedIn defines values as “The principles that guide the organization’s day-to-day decisions; a defining element of your culture“. For example “be honest and constructive“, “demand excellence“, etc.

Defining product values in addition to corporate values can be a great way for product teams to create this alignment over time and this is exactly what we do at Zeenea.

 

From corporate vision to product values: a focus on Zeenea Data Catalog

Organization & product consistency at Zeenea

At Zeenea, we have a shared vision – “Be the first step of any data journey” – and a clear mission – “To help data teams accelerate their initiatives by creating a smart & reliable data asset landscape at the enterprise level“.  

We position ourselves as a data catalog pure-player and we share the responsibility of a single product between several Product Managers. This is why we have organized ourselves into feature teams. This way, each development team can take charge of any new feature or evolution according to the company’s priorities, and carry it out from start to finish.

If we prioritize the backlog and delivery by defining and adapting our strategy and organization according to the objectives, three problems remain: 

  • How do we ensure that the product remains consistent over time when there are multiple pilots onboard the plane? 
  • How do we favor one approach over another? 
  • How do we ensure that a new feature is consistent with the rest of the application? 

Indeed, each product manager has his or her own sensitivity, his or her own background. And if the problems are clearly identified, there are usually several ways to solve them. This is where product values come into play…

Zeenea’s product values

If the vision and the mission help us to answer the “why?”, the product values allow us to remain aligned with the “how?”. It is a precious tool that challenges the different possible approaches to meet customer needs. And each Product Manager can refer to these common values to make decisions, prioritize a feature or reject it, and ensure a unified & unique user experience across the product.

Thus, each new feature is built with the following 5 product values as guides:

Simplicity

This value is at the heart of our convictions. The objective of a Data Catalog is to democratize data access. To achieve this, facilitating catalog adoption for end users is key. Simplicity is clearly reflected in the way each functionality is proposed. Many applications end up looking like Christmas trees with colored buttons all over the place that no one knows how to use; others require weeks of training before the first button is clicked. The use of the Data Catalog should not be reserved to experts and should therefore be obvious and fluid regardless of the user’s objective. This value was reflected in our decision to create two interfaces for our Data Catalog: one dedicated to search and exploration, and the other for the management and monitoring of the catalog’s documentation. 

Empowering

Documentation tasks are often time-consuming and it can be difficult to motivate knowledgeable people to share and formalize their knowledge. In the same way, the product must encourage data consumers to be autonomous in their use of data. This is why we have chosen not to offer rigid validation workflows, but rather a system of accountability. This allows Data Stewards to be aware of the impacts of their modifications. Coupled with an alerting and auditing system after the fact, it ensures better autonomy while maintaining traceability in the event of a problem.

Reassuring

It is essential to allow end-users to trust in the data they consume. The product must therefore reassure the user by the way it presents its information. Similarly, Data Stewards who maintain a large amount of data need to be reassured about the operations for which they are responsible: have I processed everything correctly? How can I be sure that there are no inconsistencies in the documentation? What will really happen if I click this button? What if it crashes? The product must create an environment where the user feels confident using the tool and its content. This value translates into preventive messages rather than error reports, a language type, idempotency of import operations, etc.

Flexibility  

Each client has their own business context, history, governance rules, needs, etc. The data catalog must be able to adapt to any context to facilitate its adoption. Flexibility is an essential value to enable the catalog to adapt to all current technological contexts and to be a true repository of data at enterprise level. The product must therefore adapt to the user’s context and be as close as possible to their uses. Our flat and incremental modeling is based on this value, as opposed to the more rigid hierarchical models offered on the market.

Deep Tech 

This value is also very important in our development decisions. Technology is at the heart of our product and must serve the other values (notably simplicity and flexibility). Documenting, maintaining, and exploiting the value of enterprise-wide data assets cannot be done without the help of intelligent technology (automation, AI, etc.). The choice to base our search engine on a knowledge graph or our positioning in terms of connectivity are illustrations of this “deep tech” value at Zeenea.

 

Take away

Creating alignment around a product is a long-term task. It requires Product Managers – in synergy with all stakeholders – to define from the very beginning: the vision, the mission, and the objectives of the company. This enables product management teams to effectively prioritize the work of their teams. However, to ensure the coherence of a product over time, the definition and use of product values  are essential. At Zeenea, our product values are simplicity, autonomy, trust, flexibility and deep-tech. They are reflected in the way we design and enhance our Data Catalog and allow us to ensure a better customer experience over time. 

 

If you would like to learn more about our product, or to get more information about Data Catalog:

What makes a data catalog “smart”? #5 – User Experience

What makes a data catalog “smart”? #5 – User Experience

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges: 

  • How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
  • How to find the most relevant datasets for any specific use case?

At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

  1. Metamodeling
  2. The data inventory
  3. Metadata management
  4. The search engine
  5. User experience

A data catalog should also be smart in the experience it offers to its different pools of users. Indeed, one of the main challenges with the deployment of a data catalog is its level of adoption from those it is meant for: data consumers. And user experience plays a major role in this adoption.

User experience within the data catalog

The underlying purpose of user experience is the identification of personas whose behavior and objectives we are looking to model in order to provide them with a slick and efficient graphic interface. Pinning down personas in a data catalog is challenging – it is a universal tool that provides added value for any company regardless of its size, across all sectors of activity anywhere in the world.

Rather than attempting to model personas that are hard to define, it’s possible to handle the situation by focusing on the issue of data cataloging adoption. Here, there are two user populations that stand out:

  • Metadata producers who feed the catalog and monitor the quality of its content – this population is generally referred to as Data Stewards;
  • Metadata consumers who use the catalog to meet their business needs – well will call them Users.

These two groups are not totally unrelated to each other of course: some Data Stewards will also be Users.

The challenges of enterprise-wide catalog adoption

The real value of a data catalog resides in large-scale adoption by a substantial pool of (meta) data consumers, not just the data management specialists.

The pool of data consumers is very diverse. It includes data experts (engineers, architects, data analysts, data scientists, etc.), business people (project managers, business unit managers, product managers, etc.), compliance and risk managers. And more generally, all operational managers are likely to leverage data to improve their performances.

Data Catalog adoption by Users is often slowed down for the following reasons:

  • Data catalog usage is sporadic: they will log on from time to time to obtain very specific answers to specific queries. They rarely have the time or patience to go through a learning curve on a tool they will only use periodically – weeks can go by between catalog usage.

  • Not everyone has the same stance on metadata. Some will focus more on technical metadata, others will focus heavily on the semantic challenges, and others might be more interested in the organizational and governance aspects.

  • Not everybody will understand the metamodel or the internal organization of the information within the catalog. They can quickly feel put off by an avalanche of concepts that feel irrelevant to their day-to-day needs.

The Smart Data Catalog attempts to jump these hurdles in order to accelerate catalog adoption. Here is how Zeenea meets these challenges.

How Zeenea facilitates catalog adoption

The first solution is the graphic interface. The Users’ learning curve needs to be as short as possible. Indeed, the User should be up and running without the need for any training. To make this possible, we made a number of choices.

The first choice was to provide two different interfaces, one for the Data Stewards and one for the Users:

Zeenea Studio: the management and monitoring tool for the catalog content – an expert tool solely for the Data Stewards.

Zeenea Explorer: for the Users –  it provides them with the simplest search and exploration experience possible.

Our approach is aligned with the user-friendly principles of marketplace solutions – the recognized specialists in catalog management (in the general sense). These solutions usually have two applications on offer. The first, a “back office” solution, which enables the staff of the marketplace (or its partners) to feed the catalog in the most automated manner possible and control its content to ensure its quality. The second application, for the consumers, usually takes the form of an e-commerce website and enables end-users to find articles or explore the catalog. Zeenea Studio and Zeena Explorer reflect these two roles.

The information is ranked in accordance with the role of the user within the organization

Our second choice is still at the experimental stage and consists in dynamically adapting the information hierarchy in the catalog according to User profiles.

This information hierarchy challenge is what differentiates a data catalog from a marketplace type catalog. Indeed, a data catalog’s information hierarchy depends on the operational role of the user. For some, the most relevant information in a dataset will be technical: location, security, formats, types, etc. Others will need to know the data semantics and their business lineage. Others still will want to know the processes and controls that drive data production – for compliance or operational considerations.

The Smart Data Catalog should be able to dynamically adjust the structure of the information to adapt to its different prisms. 

The last remaining challenge is the manner in which the information is organized in the catalog in the form of exploration paths by theme (something similar to shelving in a marketplace). It is difficult to find a structure that agrees with everybody. Some will explore the catalog along technical lines (systems, applications, technologies, etc.). Others will explore the catalog from a more functional perspective (business domains), others still from a semantic angle (through business glossaries, etc.).

The challenge of having everyone agree on a sole universal classification seems (to us) insurmountable. The Smart Data Catalog should be adaptable and should not ask Users to understand a classification that makes no sense to them. Ultimately, user experience is one of the most important success factors for a data catalog. 

post-wp-smart-data-catalog-en

For more information on how a Smart user experience enhances a Data Catalog, download our eBook:

“What is a Smart Data Catalog?”!

What makes a data catalog “smart”? #4 – The search engine

What makes a data catalog “smart”? #4 – The search engine

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges: 

  • How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
  • How to find the most relevant datasets for any specific use case?

At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

  1. Metamodeling
  2. The data inventory
  3. Metadata management
  4. The search engine
  5. User experience

A powerful search engine for an efficient exploration

Given the enormous volumes of data involved in an enterprise catalog, we consider the search engine the principal mechanism through which users can explore the catalog. The search engine needs to be easy to use, powerful, and, most importantly, efficient – the results must meet user expectations. Google and Amazon have raised the bar very high in this respect and the search experience they offer has become a reference in the field. 

This second to none search experience can be summed up thus:

  • I write a few words in the search bar, often with the help of a suggestion system that offers frequent associations of terms to help me narrow down my search.

  • The near-instantaneous response provides results in a specific order and I fully expect to find the most relevant one on page one.

  • Should this not be the case, I can simply add terms to narrow the search down even further or use the available filters to cancel out the non-relevant results.

Alas, the best currently on offer in the data cataloging market in terms of search capabilities seems to be limited to capable systems indexations, scoring, and filtering. This approach is satisfactory when the user has a specific idea of what they are looking for (high intent search) but can prove disappointing when the search is more exploratory (low intent search) or when the idea is simply to spontaneously suggest relevant results to a user (no intent).

In short, simple indexation is great for finding information whose characteristics are well known but falls short when the search is more exploratory. The results often include false positives and the order in which the search comes out is over-represented with exact matches.

A multidimensional search approach

We decided from the get-go that a simple indexation system would prove limited and would fall short of providing the most relevant results for the users. We, therefore, chose to isolate the search engine in a dedicated module on the platform and to turn it into a powerful innovation (and investment) zone.

We naturally took an interest in the work of the founders of Google on Page Rank, their algorithm. Page Rank takes into account several dozen aspects (called features), amongst which are the density of the relation between different graph objects (hypertext links in the case of internet pages), the linguistic treatment of search terms, or the semantic analysis of the knowledge graph.

Of course, we do not have the means Google has, nor its expertise in terms of search result optimization. But we have integrated into our search engine several features that provide a high level of relevant results, and those features are permanently evolving.

We have integrated the following core features:

  • Standard, flat, indexation of all the attributes of an object (name, description, and properties) weighing it up in accordance with the type of property.
  • An NLP layer (Natural Language Processing) that takes into account the near misses (typing or spelling errors).
  • A semantic analysis layer that relies on the processing of the knowledge graph.
  • A personalization layer that currently relies on a simple user classification according to their uses, and will in the future be enriched by individual profiling.

 

Smart filtering to contextualize and limit search results

To complete the search engine, we also provide what we call a smart filtering system. Smart filtering is something we often find on e-commerce websites (such as Amazon, booking.com, etc.) and it consists in providing contextual filters to limit the search result. These filters work in the following way:

  • Only those properties that help reduce the list of results are offered in the list of filters – non-discriminating properties do not show up.
  • Each filter shows its impact – meaning the number of residual results once the filter has been applied.
  • Applying a filter refreshes the list of results instantaneously.

With this combination of multi-dimensional search and smart filtering, we feel that we offer a superior search experience to any of our competitors. And our decoupled architecture enables us to explore new approaches continuously, and rapidly integrate those that seem efficient.

post-wp-smart-data-catalog-en

For more information on how a Smart search engine enhances a Data Catalog, download our eBook:

“What is a Smart Data Catalog?”!

What makes a data catalog “smart”? #2 – The Data Inventory

What makes a data catalog “smart”? #2 – The Data Inventory

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges: 

  • How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
  • How to find the most relevant datasets for any specific use case?

At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

  1. Metamodeling
  2. The data inventory
  3. Metadata management
  4. The search engine
  5. User experience

The second way to make a data catalog “smart“ is through its inventory. A data catalog is essentially a thorough inventory of information assets that include a bunch of metadata, which helps harness the information as efficiently as possible. Setting up a data catalog, therefore, depends first of all on an inventory of the assets from the different systems.

Automating the inventory: the challenges

A declarative approach to building the inventory doesn’t strike us as particularly smart, however well thought out it may be. It involves a lot of work at the launching and the up-keeping of the catalog – in a fast-changing digital landscape, the initial effort quickly becomes redundant.

The first step in creating a smart inventory is of course to automate it. With a few exceptions, enterprise datasets are managed by system specialists (involving distributed filing systems, ERPs, relational databases, software packages, data warehouses, etc.). They manage all these systems along with all the metadata required for them to work properly. There is no need to recreate this information manually: you just need to connect to the different registries and synchronize the catalog content with the source systems.

In theory, this should be straightforward but putting it into practice is actually rather difficult. The fact is, there is no universal standard to which the different technologies conform for a universal means of access to their metadata.

The essential role of connectivity to the system sources

A smart connectivity layer is a key part of the Smart Data Catalog. For a more detailed description of Zeenea’s connectivity technology, I recommend reading our previous ebook, the 5 technological breakthroughs of a next-generation catalog, but its main characteristics are:

  • Proprietary – we do not rely on third parties so as to maintain a highly specialized extraction of the metadata.
  • Distributed – in order to maximize the reach of the catalog.
  • Open – anyone looking to enrich the catalog can develop their own
  • connectors with ease.
  • Universal – it can synchronize any source of metadata.

This connectivity can not only read and synchronize the metadata contained in the source registries, it can also produce metadata.

This production of metadata requires more than simple access to the source system registries. It also requires access to the data itself, which will be analyzed by our scanners in order to enrich the catalog automatically.

To date, we produce 2 types of metadata:

  • Statistical analysis: to build a profile of the data – value distribution, rate of null values, top values, etc. (the nature of the metadata depends obviously on the native type of the data being analyzed);

  • Structural analysis: to determine the operational type of specific textual data (email, postal address, social security number, client code, etc. – the system is scalable and customizable).

The inventory mechanism must also be smart

Our inventory mechanism is also smart in several ways:

  • Dataset detection relies on extensive knowledge of the storage structures, particularly in a Big Data context. For example, an IoT dataset made up of thousands of files of time series measures can be identified as a unique dataset (the number of files and their location being only metadata).
  • The inventory is not integrated into the catalog by default to prevent the import of technical or temporary datasets that would be of little use (either because the data is unexploitable, or because it is duplicated data).

  • The selection process for the assets that should be imported into the catalog also benefits from some assistance – we strive to identify the most appropriate objects for integration in the catalog (with a variety of additional approaches to make this selection).
post-wp-smart-data-catalog-en

For more information on how Smart Data Inventorying enhances a Data Catalog, download our eBook:

“What is a Smart Data Catalog?”!

What makes a data catalog “smart”? #1 – Metamodeling

What makes a data catalog “smart”? #1 – Metamodeling

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges: 

  • How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
  • How to find the most relevant datasets for any specific use case?

At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

  1. Metamodeling
  2. The data inventory
  3. Metadata management
  4. The search engine
  5. User experience

A universal and static metamodel cannot be smart

At an enterprise scale, the metadata required to harness in any meaningful way the informational assets can be considerable. And besides, metadata is specific to each organization, sometimes even amongst different populations within an organization. For example, a business analyst won’t necessarily seek the same information as an engineer or a product manager might.

Attempting to create a universal metamodel, therefore, does not seem very smart to us. Indeed, such a metamodel would have to adapt to a plethora of different situations, and will inevitably fall victim to one of the 3 pitfalls below:

  • Excessive simplicity which won’t cover all the use cases needed;
  • Excessive levels of abstraction with the potential to adapt to a number of contexts at the cost of arduous and time-consuming training – not an ideal situation for an enterprise-wide catalog deployment;
  • Levels of abstraction lacking depth and ultimately leading to a multiplicity of concrete concepts bourn out on a combination of notions emanating from a variety of different contexts – many of which will be useless in any specific context, rendering the metamodel needlessly complicated and potentially incomprehensible.

In our view, smart metamodeling should ensure a metamodel that adapts to any context and can be enriched as use cases or maturity levels develop over time.

The organic approach to a metamodel

A metamodel is a field of knowledge and the formal structure of a knowledge model is referred to as an ontology.

An ontology defines a range of object classes, their attributes, and the relationships between them. In a universal model, the ontology is static – the classes, the attributes, and the relations are predefined, with varying levels of abstraction and complexity.

Zeenea chose not to rely on a static ontology but rather on a scalable knowledge graph.

The metamodel is therefore voluntarily simple at the start – there are only a handful of types, representing the different classes of information assets (data sources, datasets, fields, dashboards), each with a few essential attributes (name, description, contacts).

This metamodel is fed automatically by the technical metadata extracted from the datasources which vary depending on the technology in question (the technical metadata of a table in a data warehouse differs from the technical metadata of a file in a datalake).

Smart Data Catalog - Metamodel

For Zeenea, this organic metamodeling is the smartest way to handle the ontology issue in a data catalog. Indeed, it offers several advantages:

  • The metamodel can adapt to each context, often relying on a pre-existing model, integrating the inhouse nomenclature and terminology without the need for a long and costly learning curve;
  • The metamodel does not need to be fully defined before using the data catalog – you will only need to focus on a few classes of objects and the few necessary attributes to cover the initial use cases. You can then load the model as catalog adoption progresses over time;
  • User feedback can be integrated progressively, improving catalog adoption, and as a result, ensuring return on investment for the metadata management.

Adding functional attributes to the metamodel in order to facilitate searching

There are considerable advantages to this metamodeling approach, but also one major inconvenience: since the metamodel is completely dynamic, it is difficult for the engine to understand the structure, and therefore difficult for it to help users feed the catalog and use the data (two core components of a Smart Data Catalog).

Part of the solution relates to the metamodel and the ontology attributes. Usually, metamodel attributes are defined by their technical types (date, number, chain of characters, list of values, etc.). With Zeenea, these library types do include these technical types of course.

But they also include functional types – quality levels, confidentiality levels, personal touch, etc. These functional types enable the Zeenea engine to better understand the ontology, refine the algorithms and adapt the representation of the information.

post-wp-smart-data-catalog-en

For more information on how Smart Metamodelling enhances a Data Catalog, download our eBook:

“What is a Smart Data Catalog?”!

The business glossary: a productivity lever for a data catalog

The business glossary: a productivity lever for a data catalog

An organization needs to handle vast volumes of technical assets that often carry a lot of duplicate information in various systems. Documenting all these assets one by one is a near impossible challenge to overcome for most companies. 

With the help of automation, a certain amount of information is collected and this often provides detailed technical documentation of what is in the information system. Standard data catalog solutions then enable knowledgeable data users to complete this documentation, by adding classification attributes to describe in more detail the company’s technical ecosystem.

While this information is useful for more technically oriented users (engineers, architects, etc.), it often remains unclear to the growing number of enterprise data consumers. This will prevent these from exploiting and governing the data effectively.

In order to provide the necessary context for the use of this data, users need different types of information: organizational, statistical, compliance, etc. 

To be clear, the technical documentation must be accompanied with semantic information. This is the purpose of the implementation of a business glossary.

Building a common language with a business glossary

When business users talk about data, they usually refer to concepts such as customer address, sales, or the turnover of 2021. It’s unlikely they are referring to a table or a database schema, which they may not know or understand. A business glossary helps define these concepts and share these definitions amongst all employees.

The addition of semantic metadata thus meets several objectives:

  • To bridge the gap between business and technical users, by building a common language that allows them to collaborate effectively;
  • Align business users, especially those from different entities, with these definitions. This will prevent ambiguities between related terms;
  • Enable all users to find the data they are looking for with greater ease, and provide the necessary context to understand and use it.

A good data catalog tool must therefore provide a solution that administers these business concepts, allows them to be linked to the technical assets that implement these concepts, and thus enable the use of the catalog enterprise-wide.

Want to learn more about Business Glossaries?

If you would like to dive deeper into how a Business Glossary is an essential component of a data catalog, download our free eBook : “Business Glossary: An essential component of a Data Catalog for data fluent organizations”.

In this eBook, you will find: 

  • a complete description of the existing approaches on hand depending on the company objective when it comes to describing a domain of knowledge: lexicon, thesaurus, formal ontology;
  • a presentation of Zeenea’s graph-based business glossary approach, which offers the flexibility, simplicity and scalability necessary to cover the needs of data consumers.

    How to build an efficient permission management system for a data catalog

    How to build an efficient permission management system for a data catalog

    An organization’s data catalog enhances all available data assets by relying on two types of information – on the one hand, purely technical information that is automatically synchronized from their sources; and on the other hand, business information that comes from the work of Data Stewards. The latter is updated manually and thus brings its share of risks to the entire organization.

    A permission management system is therefore essential in order to define and control the access rights of catalog users. In this article, we detail the fundamental characteristics and the possible approaches to build an efficient permission management system, as well as the solution implemented by Zeenea Data Catalog.

    Permission management system: an essential tool for the entire organization

    For data catalog users to trust in the information they are viewing, it is essential that the documentation of cataloged objects is relevant, of high quality and, above all, reliable. Your users must be able to easily find, understand and use the data assets at their disposal. 

    The origin of catalog information and automation 

    A data catalog generally integrates two types of information. On the one hand, there is purely technical information that comes directly from the data source. At Zeenea, this information is synchronized in a completely automated and continuous way between the data catalog and each data source, to guarantee its veracity and freshness. On the other hand, the catalog contains all the business or organizational documentation, which comes from the work of the Data Stewards. This information cannot be automated; it is updated manually by the company’s data management teams.

    A permission management system is a prerequisite for using a data catalog

    To manage this second category of information, the catalog must include access and input control mechanisms. Indeed, it is not desirable that any user of your organization’s data catalog can create, edit, import, export or even delete information without having been given prior authorization. A user-based permission management system is therefore a prerequisite; it plays the role of a security guard for the access rights of users.

     

    The 3 fundamental characteristics of a data catalog’s permission management system

    The implementation of an enterprise-wide permission management system is subject to a number of expectations that must be taken into account in its design. Among them, we have chosen in this article to focus on three fundamental characteristics of a permission management system: its level of granularity and flexibility, its readability and auditability, and its ease of administration.

    Granularity and flexibility

    First of all, a permission management system must have the right level of granularity and flexibility. Some actions should be available to the entire catalog for ease of use. Other actions should be restricted to certain parts of the catalog only. Some users will have global rights related to all objects in the catalog, while others will be limited to editing only the perimeter that has been assigned to them. The permission management system must therefore allow for this range of possibilities, from global permission to the fineness of an object in the catalog. 

    At Zeenea, for example, our clients are of all sizes, with very heterogeneous levels of maturity regarding data governance. Some are start-ups, others are large companies. Some have a data culture that is already well integrated into their processes, while others are only at the beginning of their data acculturation process. The permission management system must therefore be flexible enough to adapt to all types of organizations.

    Readability and auditability

    Second, a permission management system must be readable and easy to follow. During an audit, or a review of the system’s permission, an administrator who explores an object must be able to quickly determine who has the ability to modify it. Conversely, when an administrator looks at the details of a user’s permission set, they must quickly be able to determine the scope that is assigned to that user and their authorized actions on it. 

    This simply ensures that the right people have access to the right perimeters, and have the right level of permission for their role in the company. 

    Have you ever found yourself faced with a permission system that was so complex that it was impossible to understand why a user was allowed to access information? Or on the contrary was unable to do so?

    Simplicity of administration

    Finally, a permission management system must be resilient in facing the increasing catalog volume. We know today that we live in a world of data: 2.5 exabytes of data were generated per day in 2020 and it is estimated that 463 exabytes of data will be generated per day in 2025. New projects, new products, new uses: companies must deal with the explosion of their data assets on a daily basis.

    To remain relevant, a data catalog must evolve with the company’s data. The permission management system must therefore be resilient to changes in content or even to the movement of employees within the organization.

     

    Different approaches to designing a data catalog permission management system

    There are different approaches to designing a data catalog permission management system, which more or less meet the main characteristics expected and mentioned above. We have chosen to detail three of them in this article.

    Crowdsourcing

    First, the crowdsourcing approach – where the collective is trusted to self-correct. A handful of administrators can moderate the content and all users can contribute to the documentation. An auditing system usually completes the system to make sure that no information is lost by mistake or malice. In this case, there is no control before documenting, but a collective correction afterwards. This is typically the system chosen by online encyclopedias such as Wikipedia. These systems depend on the number of contributors and their own knowledge to work well, as self-correction can only be effective through the collective. 

    This system perfectly meets the need for readability – all users have the same level of rights, so there is no question about the access control of each user. It is also simple to administer – any new user has the same level of rights as everyone else, and any new object in the data catalog is accessible to everyone. On the other hand, there is no way to manage the granularity of rights. Everyone can do and see everything.

    Permission attached to the user 

    The second approach to designing the permission management system is using solutions where the scope is attached to the user’s profile. When a user is created in the data catalog, the administrators assign a perimeter that defines the resources that they will be able to see and modify. In this case, all controls are done upstream and a user cannot access a resource inadvertently. This is the type of system used by an OS such as Windows for example.

    This system has the advantage of being very secure, there is no risk that a new resource will be visible or modifiable by people who do not have the right to do so. This approach also meets the need for readability: for each user, all the accessible resources are easy to find. The expected level of granularity is also good, since it is possible to allocate the data system resource by resource. 

    On the other hand, administration is more complex – each time a new resource is added to the catalog, it must be added to the perimeters of the said users. It is possible to overcome this limitation by creating dynamic scopes. To do this, you can define rules that assign resources to users, for example all PDF files will be accessible to so-and-so. But contradictory rules can easily appear, complicating the readability of the system.

    Permission attached to the resource 

    The last major approach to designing a data catalog’s permission management system is to use solutions where the authorized actions are attached to the resource to be modified. For each resource, the possible permissions are defined user by user. Thus it is the resource that has its own permission set. By looking at the resource, it is then possible to know immediately who can view or edit it. This is for example the type of system of a UNIX-like OS.

    The need for readability is perfectly fulfilled – an administrator can immediately see the permissions of different users when viewing the resource. The same goes for the need for granularity – this approach allows permissions to be given at the most macro level through an inheritance system, or at the most micro level directly on the resource. Finally, in terms of ease of administration, it is necessary to attach each new user to the various resources, which is potentially tedious. However, there are group systems that can mitigate this complexity.

     

    The Zeenea Data Catalog permission management model: simple, readable and flexible

    Among these approaches, let’s detail the one chosen by Zeenea and how it is applied.

    The resource approach was preferred

    Let’s summarize the various advantages and disadvantages of each of the approaches discussed above. In both resource and user-level permission management systems, the need for granularity is well addressed – these systems allow for resource-by-resource permission to be assigned. In contrast, in the case of crowdsourcing, the basic philosophy is that anyone can access anything. Readability is clearly better in crowdsourcing systems or in systems where permissions are attached to the resource. It remains adequate in systems where permissions are attached to the user, but often at the expense of simplicity of administration. Finally, the simplicity of administration is very much optimized for the crowdsourcing approach and depends on what you are going to modify the most – the resource or the users.

    Since the need for granularity is not met in the crowdsourcing approach, we eliminated it. We were then left with two options: resource-based permission or user-based permission models. Since the readability is a bit better with resource-based permission, and since the content of the catalog will evolve faster than the number of users, the user-based permission option seemed the least relevant.

    The option we have chosen at Zeenea was therefore the third one: user permissions are attached to the resource.

    How the Zeenea Data Catalog permission management system works

    In Zeenea Data Catalog, it is possible to define for each user if they have the right to manipulate the objects of the whole catalog, one or several types of objects, or only those of their perimeter. This allows for the finest granularity, but also for more global roles. For example, “super-stewards” could have permission to act on entire parts of the catalog, such as the glossary.

    We then associate a list of Curators with each object in the catalog, i.e., those responsible for documenting that object. Thus, simply by exploring the details of the object, one can immediately know who to contact to correct or complete the documentation, or to answer a question about it. The system is therefore readable and easy to understand. The users’ scope of action is precisely determined through a granular system, right down to the object in the catalog.

    When a new user is added to the catalog, it is then necessary to define its scope of actions. For the moment, this configuration is done through the bulk editing of objects. In order to simplify management even further, it will soon be possible to define specific groups of users, so that when a new collaborator arrives there is no longer any need to add them by name to each object in their scope. Instead, they simply need to be added to the group, and their scope will be automatically assigned to them.

    Finally, we have voluntarily chosen not to implement a documentation validation workflow in the catalog. We believe that team accountability is one of the keys to the success of a data catalog adoption. This is why the only control we put in place is the one that determines the user’s rights and scope. Once these two elements have been determined, the people responsible for the documentation are free to act! The system is completed with an event log on modifications to allow complete auditability, as well as a discussion system on the objects. It allows everyone to suggest changes or report errors on the documentation.

    If you would like to learn more about our permission management model, or get more information about our Data Catalog

     

    Breaking down Data Lineage: typologies and granularity

    Breaking down Data Lineage: typologies and granularity

    As a concept, Data Lineage seems universal: whatever the sector of activity, any stakeholder in a data-driven organization needs to know the origin (upstream lineage) and the destination (downstream lineage) of the data they are handling or interpreting. And this need has important underlying motives.

    For a Data Catalog vendor, the ability to manage Data Lineage is crucial to its offer. As is often the case however, behind a simple and universal question lies a world of complexity that is difficult to grasp. This complexity is partially linked to the heterogeneity of answers that vary from one interlocutor to another in the company.

    In this article, we will explain our approach to breaking down data lineage according to the nature of the information sought and its granularity.

     

    The typology of Data Lineage: seeking the origin of data

    There are many possible answers as to the origin of any given data. Some will want to know the exact formula or semantics of the data. Others will want to know from which system(s), application(s), machine(s), or factory it comes from. Some will be interested in the business or operational processes that produced the data. Some will be interested in the entire upstream and downstream technical processing chain. It’s difficult to sort through this maze of considerations!

    A layer approach

    To structure lineage information, we suggest emulating what is practiced in the field of geo-mapping by distinguishing several superimposable layers. We can identify three:

    • The physical layer, which includes the objects of the information system – applications, systems, databases, data sets, integration or transformation programs, etc.
    • The business layer, which contains the organizational elements – domains, business processes or activities, entities, managers, controls, committees, etc.
    • The semantic layer, which deals with the meaning of the data – calculation formulas, definitions, ontologies, etc.
      data-lineage-layers-EN-zeenea

      A focus on the physical layer

      The physical layer is the basic canvas on which all the other layers can be anchored. This approach is again similar to what is practiced in geo-mapping: above the physical map, it is possible to superimpose other layers carrying specific information.

      The physical layer represents the technical dimension of the lineage; it is materialized by tangible technical artifacts – databases, file systems, integration middleware, BI tools, scripts and programs, etc. In theory, the structure of the physical lineage can be extracted from these systems, and then largely automated, which is not generally the case for the other layers.

      The following seems fundamental: for this bottom-up approach to work, it is necessary that the physical lineage be complete.

      This does not mean that the lineage of all physical objects must be available, but for the objects that do have lineage, this lineage must be complete. There are two reasons for this. The first reason is that a partial (and therefore false) lineage risks misleading the person who consults it, jeopardizing the adoption of the catalog. Secondly, the physical layer serves as an anchor for the other layers which means any shortcomings in its lineage will be propagated.

      In addition to this layer-by-layer representation, let’s address another fundamental aspect of lineage: its granularity.

       

      Granularity in Data Lineage

      When it comes to lineage granularity, we identify 4 distinct levels: values, fields (or columns), datasets and applications.

      The values can be addressed quickly. Their purpose is to track all the steps taken to calculate any particular data (we’re referring to specific values, not the definition of any specific data). For mark-to-model pricing applications, for example, the price lineage must include all raw data (timestamp, vendor, value), the values derived from this raw data as well as the versions of all algorithms used in the calculation.

      Regulatory requirements exist in many fields (banking, finance, insurance, healthcare, pharmaceutical, IOT, etc.), but usually in a very localized way. They are clearly out of the reach of a data catalog, in which it is difficult to imagine managing every data value! Meeting these requirements calls for either a specialized software package or a specific development.

      The other three levels deal with metadata, and are clearly in the remit of a data catalog. Let’s detail them quickly.

      The field level is the most detailed level. It consists of tracing all the steps (at the physical, business or semantic level) for an item of information in a dataset (table or file), a report, a dashboard, etc., that enable the field in question to be populated.

      At the dataset level, the lineage is no longer defined for each field but at the level of the field container, which can be a table in a database, a file in a data lake, an API, etc. On this level, the steps that allow us to populate the dataset as a whole are represented, typically from other datasets (we also find on this level other artifacts such as reports, dashboards, ML models or even algorithms).

      Finally, the application level enables the documentation of the lineage macroscopically, focusing on high-level logical elements in the information system. The term “application” is used here in a generic way to designate a functional grouping of several datasets.

      It is of course possible to imagine other levels beyond those 3 (grouping applications into business domains, for example), but increasing the complexity is more a matter of flow mapping than lineage.

      Finally, it is important to keep in mind that each level is intertwined with the level above it. This means the lineage from the higher level can be worked out from the lineage of the lower level (if I know the lineage of all the fields of a dataset, then I can infer the line age of this dataset).

      We hope that this breakdown of data lineage will help you better understand it for your organization. In a future article, we will share our approach so that each business can derive maximum value from Lineage thanks to our typology / granularity / business matrix. 

      To learn more about Data Lineage best practices, download our eBook: All you’ve ever wanted to know about Data Lineage!

      data-lineage-white-paper-mockup-en

      How pivoting to a SaaS model allowed 320 production releases in 6 months

      How pivoting to a SaaS model allowed 320 production releases in 6 months

      After starting out as an on-premise data catalog solution, Zeenea made the decision to switch to a fully SaaS solution. A year and a half later, more than three hundred production releases have been carried out over the last six months, an average of almost three per day! We explain here the reasons that pushed us to make this pivot, the organization put in place to execute it, as well as the added value for our customers.

       

      Zeenea’s beginnings: an on-prem’ data catalog

      When Zeenea was created in 2017, it was an on-premise solution, meaning that the architecture was physically present within our client’s companies. This choice was made in response to two major issues: the security of a solution that accesses the customer’s data systems is essential and must be guaranteed; most of our customers’ information systems relied on on-premise database management systems, which could not be accessed outside of these companies’ internal networks.

      This approach, however, was a constraint to the expansion and evolution of Zeenea. The first reason was that it required a lot of customer support for deployments. The second reason was that several versions could be in production at different customers simultaneously. Also, it was complicated to deploy urgent fixes. Finally, the product value developed was only updated at the customer’s site at a late stage.

       

      The strategic pivot to a 100% SaaS data catalog

      Faced with these potential obstacles to the development of our data catalog, we naturally decided at the end of 2019 to make the switch to a fully SaaS solution. A year and a half later, we have just completed more than three hundred production releases over the past six months, an average of almost three per day. Here’s how we did it.

      First, we addressed the initial security issue. We integrated security into our cloud practices right from the start of the project, and have in fact launched a security certification process in this regard (SOC2 and soon ISO27001). 

      Then, we extracted from our architecture the only brick that had to remain on-premise: the Zeenea scanner. From a technological point of view, we set up a multi-tenant SaaS architecture, by splitting our historical monolith into several application bricks. 

      However, the biggest challenge did not lie in the technical aspects, but in the cultural and organizational aspects…

       

      The keys to our success: organization and acculturation to the SaaS model

      We have built and consolidated our SaaS culture, mainly by orienting our recruitments towards experienced profiles in this field, and by organizing knowledge sharing efficiently.

      To illustrate the cultural aspect, we distinguish, for example, finished developments from complete developments. At Zeenea, a development is considered finished when it is integrated into the code base, without any known bugs, with a level of security and engineering that conforms to the level of requirements that we set for ourselves. A development is considered complete when it can be made available to our customers, so that the developed functionalities form a usable and coherent whole. 

      To support this distinction, we have implemented a feature toggle mechanism to manage the activation of fully developed features: a development is systematically put into production as soon as it is finished, and then activated for our customers once it is complete.

      In terms of organization, we have set up Feature Teams: each team works on a given feature, on all its components. As soon as a feature is complete, it is delivered. Other features are delivered incomplete, deactivated, but finished.

       

      The SaaS model and added value for our customers 

      The first to benefit from the agility of the SaaS model are obviously Zeenea’s customers. The functionalities are available more quickly, that is to say as soon as they are complete. Moreover, the deployment of a new functionality can be done at their convenience within two months after the feature toggle is made available. This allows for easy integration into the customer’s context, notably by integrating their user constraints. Finally, this ability to activate features allows us to demonstrate the features in advance, or even in some cases to activate them in beta testing for our customers.

      All this is obviously combined with the traditional advantages of a SaaS solution: automatic and frequent updates of minor evolutions or corrections, access to the solution from any browser, the absence of infrastructure at our customers’ sites allowing rapid scalability, etc. 

      If the path to pivot from an on-premise model to a SaaS application has had many challenges, we are proud today to have met the challenge of implementing continuous deployment and to bring more and more added value to our customers.

      We are always looking for new talents to join our features teams, so if you want to join the Bureau des Légendes, the IT Crowd or the ACDC team, contact us.

      Data Curation: essential for enhancing your data assets

      Data Curation: essential for enhancing your data assets

      Having large volumes of data isn’t enough: it’s what you make of it that counts! To make the most out of your data, you need to distill a real data culture within your company. The foundation of this culture is data curation.

      0% of the world’s data has been created in the last two years. With the exponential growth of connected devices, companies will be confronted with the unfortunate reality that our ability to create data will far surpass our ability to manage and exploit it. 

      And it’s not going to get any better! According to estimates published in Statista’s Digital Economy Compass 2020, the annual volume of digital data created globally has increased more than 20-fold over the past decade and will surpass the 50 zettabyte mark by 2021!

      In this context, it is not surprising that most companies are currently only able to analyze 12% of the data they have at their disposal! Because, behind the collection, storage and security of data, there is above all the business value that can be derived from it.

      This is the challenge addressed by the concept of Data Curation: the essential step to exploit the potential of an organization’s abundant data assets. 

       

      The definition of Data Curation

      According to the definition given by the INIST (Institut de l’Information Scientifique et Technique), which is attached to the CNRS, 

      “Curation refers to all the activities and operations necessary for the active management of digital research data, throughout their life cycle. The objective is to make them accessible, shareable and reusable in a sustainable way. Three stakeholders can be identified in the data life cycle: the creators, most often researchers, the “curators” and the users.”

      In other words, data curation is a task that consists of identifying in a data catalog those that can be valorized, exploited and in a second time, put them at the disposal of users likely to draw the best lessons from them.

      To set up an efficient and relevant Data Curation, you need to start with a precise mapping of the available data. This initial mapping is the basis for pragmatic and operational data governance. 

      Once the rules of governance have been established, it is towards the data user that all attention must be focused. Data is a mineral that is only worthwhile if it is properly valued. This valuation must be thought of as a response to the user’s needs. It is the latter who is at the origin of the data curation project. 

      An iterative and continuous process for data exploitation, distinct from all the tasks essential to data governance (from quality management to data protection and even data life cycle management).

      Data Curation: essential prerequisites, undeniable benefits

      Data Curation is a perspective of rapid and massive development of a data culture within an organization.

      The creation of a data management and curation strategy allows you to take stock of the data produced. It is then possible to select the most relevant data and enrich it with the metadata necessary to understand and reuse it, including by business users.

      Everyone in the company can then base their choices, decisions, strategies and methods on the systematic use of data, without having to have specific skills.

      The objective: creating the conditions for systematic use of data as a basis for any project or approach, and not to limit its use to Data Science or data expert teams.

      To effectively deploy your data curation strategy, you must therefore rely on elements that are essential to the proper management of your data assets. The heart of the reactor is not limited to data catalogs! 

      If they are essential and directly result from your data map, metadata governance plays an even more crucial role. Metadata makes it easier for users to interact with data portfolios in natural language. 

      With data curation, get into a data-driven dynamic for good!

      The 7 lies of Data Catalog Providers – #7 – A Data Catalog is complex…but isn’t complicated!

      The 7 lies of Data Catalog Providers – #7 – A Data Catalog is complex…but isn’t complicated!

      The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

       These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

      The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

      The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

      A Data Catalog is complex…but isn’t complicated

      This closing statement seems to us, a fit ting summary for all of the above,

      and it will be our conclusion.

      We have seen too many Data Catalog initiatives morph into endless data governance projects that try to solve a laundry list of issues- ignoring those easily solved by a Data Catalog. Once you have removed the extra baggage.

      The deployment of a Data Catalog only takes a few days, rather than months, to produce value.

      The services rendered by a Data Catalog are simple. In its leanest form, a Data Catalog presents as a search bar, in which any user can type in a few keywords (or even pose a question in a natural language) and obtain a list of results with the first 5 elements being the most relevant, thus providing him with all the information he needs to use the data (just like a web search engine, or an online retailer).

      This ease of use is crucial to guarantee adoption by the data teams. On the user front, the Data Catalog should be a simple affair with a clean design. Like any other search or recommendations engine, however, the underlying complexity is substantial.

      The good news for the customer is that this complexity is nothing for you to worry about, it’s on us.

      Zeenea has invested enormously on the structure of the information (building a knowledge graph), on automation and on the search and recommendations engine. This complexity isn’t visible but it is what constitutes the value of a Data Catalog.

      The obsession for simplicity is at the heart of our values. Each functionality we choose to add to the product has to tick one of the two boxes below:

      • Does this functionality help deploy the catalog faster in the organization?
      • Does this functionality enable the data teams to find the information more quickly in order to get on with their projects?

      If neither of the questions above are answered by yes, the functionality will be discarded.

      The result is that you can connect Zeenea to your operational systems, configure and feed your first metamodel, and open the catalog to the end users within a matter of days.

       

      Of course, going forward, you’ll need to complete the metamodel, integrate other sources, etc. But the value creation is instant.

       

      Take Away

      In line with the search for simplicity, the Data Catalog need not be an expensive solution.

       

      This is true for the implementation costs – a Data Catalog deployment does not require thousands of work hours.

      We offer a deployment program between 3 to 6 weeks, that includes the onboarding, integration, and the creation of a first metamodel for 3,000 Euros. This also applies for the software costs, forget 6 figure bills!

      Our starting price is 18,000 Euros per year, for 5 data stewards, 50 data consumers, and 3 types of connectors.

       

       

      Download our eBook: The 7 lies of Data Catalog Providers for more!

      The 7 lies of Data Catalog Providers – #6 A Data Catalog must rely on automation!

      The 7 lies of Data Catalog Providers – #6 A Data Catalog must rely on automation!

      The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

       These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

      The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

      The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

      A Data Catalog must reply on automation!

      Some Data Catalog vendors, who hail from the world of cartography, have developed the rhetoric that automation is a secondary topic, which can be addressed at a later stage.

      They will tell you that a few manual file imports suffice, along with a generous user community collaborating on their tool to feed and use the catalog. A little arithmetic is enough to understand why this approach is doomed to failure in a data-centric organization.

      An active Data Lake, even a modest one, quickly hoovers up, in its different layers, hundreds and even thousands of datasets. Along with these datasets, can be added those from other systems (database applications, various APIs, CRMs, ERPs, noSQL, etc) which we usually want to integrate in the catalog.

      The orders of magnitude quickly go beyond thousands, sometimes tens of thousands of datasets. Each dataset contains dozens of fields. Datasets and fields alone represent several hundreds of thousands of objects (we could also include other assets: ML models, dashboards, reports, etc). In order for the catalog to be useful, inventorying those objects isn’t enough.

      You also need to combine with them all the properties (metadata) which will enable end users to find, understand, and exploit these assets. There are several types of metadata: technical information, business classification, semantics, security, sensitivity, quality, norms, uses, popularity, contacts, etc. Here again, for each asset, there are dozens of properties.

      Back to the arithmetics: Overall, we are dealing with millions of attributes needing to be maintained.

      Such volumes alone should disqualify any temptation to choose the manual approach. But there is more. The stock of informational assets isn’t static. It is constantly growing. In a data-centric organization, datasets are created daily, others are moved or changed.

      The Data Catalog needs to reflect these changes.

       

      Otherwise, its content will be permanently obsolete and the end users will reject it. Who is going to trust a Data Catalog that is incomplete and wrong? If you feel that your organization can absorb the load and keep your catalog up to date, that’s wonderful. Otherwise, we would suggest you monitor as quickly as possible the level of automation provided by the different solutions you are looking at.

       

      What can we automate in a Data catalog?  

      In terms of automation, the most important capacity is the inventory.

      A Data Catalog should be able to regularly scan all your data sources and automatically update the asset inventory (datasets, structures and technical metadata at a minimum) to reflect the day-to- day reality of the hosting systems.  

      Believe us: a Data Catalog that cannot connect to your da ta sources will quickly become useless, because its content will always be in doubt.

       

      Once the inventory is completed, the next challenge is to automate the metamodel feed.

      Here, beyond the technical metadata, complete automation seems a little hard to imagine. It is still possible to significantly reduce the necessary workload for the maintenance of the metamodel. The value of certain properties can be determined by simply applying rules at the time of the integration of the objects in the catalog.

      It is also possible to suggest property values using more or less sophisticated algorithms (semantic analysis, pattern matching, etc.).

      Lastly, it’s often possible to feed part of the catalog by integrating the systems that produce or contain metadata. This can apply for instance for quality measurement, for lineage information, for business ontologies, etc.

      For this approach to work, the Data Catalog must be open and offer a complete set of APIs that allow the metadata to be updated from other systems.

       

      Take Away

      A Data Catalog handles millions of information in a constantly shifting landscape.

      Maintaining this information manually is virtually impossible, or extremely costly. Without automation, the content of the catalog will always be in doubt, and the data teams will not use it.

      Download our eBook: The 7 lies of Data Catalog Providers for more!

      Pricing your Enterprise Data Catalog: What should it really cost?

      Pricing your Enterprise Data Catalog: What should it really cost?

      For the last couple of years, the Zeenea international sales team has been in contact with prospective clients the world over, presenting Zeenea, running product demos, submitting RFPs (mostly against the big guns in the industry), advising CDOs on data management strategies, accompanying existing customers with their catalog deployment, helping address technical challenges when they arise and always discussing pricing…

      Anyone who has been actively in the market for data catalogs, pricing will feel like a complex affair which varies widely depending on use cases, providers (SaaS, On-prem, annual subscription etc.), number of users, etc. 

      This blog post seeks to demonstrate that putting a price tag on a data catalog should not really be a complex affair and while the technology that makes a catalog tick can be quite impressive (especially if its powered by a knowledge graph), its ultimate purpose is to help data users access, curate and leverage information that already exists. Nothing more nothing less.  

      Below, we’ll relate a couple of fairly typical examples of Zeenea’s approach to pricing which resulted in customer wins for us, and money saved for the customer.

       

      Avoid the quagmire of maximum cost/minimal adoption.

      One telling example of the financial repercussions of mismanaged data catalog adoption occurred with a large French bank (Zeenea is a Paris-based startup).

      This bank had initially chosen a well known data catalog provider to help with the management, governance and quality of their data. Like all financial institutions, this one was subject to BCBS 239 and compliance was therefore a key issue. The catalog provider, whose platform boasts a very sophisticated predefined metamodel with tons of features, therefore set out to help the customer organize its data governance around BCBS 239. Unfortunately, the project quickly began to turn sour for a couple of reasons:  

      1-  The “out of the box” metamodel, it turned out, didn’t readily fit with the organization of the data teams. The predictable consequence of this mismatch was having to allocate far more internal resources than planned to handle their compliance initiative.

      2- The focus on Data Governance didn’t really deal with the needs of business teams looking for easy access to their datasets. This unhappy situation had a negative impact on the overall adoption of the data catalog and end users simply stopped using it. 

      By the time the client stopped the initiative, there were a very limited number of compliance experts actually using the catalog for a bill of a few million euros per year…Suffice it to say that the purse holders were not entirely satisfied with the cost-benefit ratio…

      The lesson here is that in order to successfully implement a data catalog long term, one needs early and durable user adoption. A simple, well defined, use case with a manageable number of stakeholders will go a long way to achieving that whilst ensuring costs are kept within reasonable limits. This is precisely what Zeenea did with this customer.

      The bill was reduced 10 fold and catalog adoption of our Explorer platform tripled.  

      Don’t crack a walnut with a (costly) sledgehammer.

       

      Another customer (an SME from the UK in the retail sector), whose data landscape consists of a BI tool, PostgresQL, an ETL platform and PowerBI, reached out to us having already received quotes from other well established data catalog providers. 

      Their use case was straightforward: The data engineering team needed to clean up their data lake after years of neglect and the data users in the group, analytics folk mostly, wanted to access data assets more easily. 

      I’m paraphrasing but the overriding sentiment was that the solutions they had looked at before consulting Zeenea, whilst undoubtedly useful for larger organisations with complex data landscapes, data governance and compliance imperatives, had too many features which were irrelevant to their actual requirements.

      The customer was eventually sent a 6 figure quote which the CFO promptly discarded.

      Today this mismatch is the most common one we come across, especially amongst small data teams with more straightforward use cases for a data catalog. Unlike the French bank mentioned above, most SMEse cannot afford to spend, let alone lose, vast amounts of money on a failed data experiment. 

       

      So how much should a data catalog really cost…?

      Early in 2020, we decided to take a different approach to pricing, one that was coherent with our “Start small, Scale fast” approach to data cataloging adoption. This pricing model, presented as a “Basic Data Discovery”, was designed to encourage data teams to start their data cataloging journey with a straightforward use case, roll out the catalog to other users/projects incrementally, thus ensuring maximum catalog adoption and, crucially, keep a handle on cost

      Our basic Data Discovery conditions are simple (and easy to deliver on) – This offer includes the POC, 2 connectors, 2 data stewards, 20 data explorers and of course customer support.

       

      Provide us with…

      • A Use Case for the POC.
      • The Data sources you need to connect to (up to 2 for the Basic Data Discovery).
      • The number of Data Stewards needed (up to 2 for the Basic Data Discovery).
      • The number of Data Explorers needed (up to 20 for the Basic Data Discovery). 

       

      …and Zeenea will provide you with a quote your CFO can depend on. It really is that simple.*

        

      Click here for more information, or to request a POC.

       

      *These conditions were not chosen at random. In our experience, data teams looking to roll out a cataloging solution seldom choose a use case requiring more than 2 data connectors and a handful of data stewards, and we don’t necessarily recommend that they do.

      The 7 lies of Data Catalog Providers – #5 A Data Catalog is not a Business Modeling Solution!

      The 7 lies of Data Catalog Providers – #5 A Data Catalog is not a Business Modeling Solution!

      The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

       These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

      The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

      The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

      A Data Catalog is NOT a Business Modeling Solution

      Some organizations, usually large ones, have invested for years in the modeling of their business processes and information architecture.

      They have developed several layers of models (conceptual, logical, physical) and have put in place an organization that helps the maintenance and sharing of these models with specific populations (business experts and IT people mostly).

      We do not question the value of these models. They play a key role in the urbanization, the schema blueprints, the IS management, as well as regulatory compliance. But we seriously doubt that these modeling tools can provide a decent Data Catalog.

      There is also a market phenomenon at play here: certain historical business modeling players are looking to widen the scope of their offer by positioning themselves on the Data Catalog market. After all, they do already manage a great deal of information on physical architecture, business classifications, glossaries, ontologies, information lineage, processes and roles, etc. But we can identify two major flaws in their approach.

      The first is organic. By their nature, modeling tools produce top-down models to outline the information in an IS. However accurate it may be, a model remains a model: a simplified representation of reality.

      They are very useful communication tools in a variety of domains, but they are not an exact reflection of the day-to-day operational reality which, for me, is crucial to keeping the promises of a Data Catalog (enabling teams to find data, understanding and knowing how to use the datasets).

      The second flaw?: It is not user -friendly.

      A modeling tool is complex and handles an important number of abstract concepts which require an important learning curve. It’s a tool for experts.

      We could consider improving user friendliness of course to open it up to a wider audience. But the built-in complexity of the information won’t go away.

      Understanding the information provided by these tools requires a solid understanding of modeling principles (object classes, logical levels, nomenclatures, etc). It is quite a challenge for data teams and a challenge that seems difficult to justify from an operational perspective.

      The truth is, modeling tools that have been turned into Data Catalogs are faced with important adoption issues with the teams (they have to make huge efforts to learn how to use the tool, only to not find wha t they are looking for).  

      A prospective client recently presented us with a metamodel they had built and asked us whether it was possible to implement it in Zeenea. Derived from their business models, the metamodel had several dozen classes of objects and thousands of attributes. To their question, the official answer was yes (the Zeenea metamodel is very flexible). But instead, we tried to dissuade them from taking that path: A metamodel that sophisticated ran the risk, in our opinion, of losing the end users, and turning the Data Catalog project into a failure…

      Should we therefore abandon business models when putting a Data Catalog in place? Absolutely not.

      It must, however, be remembered that business models are there to handle some issues, and the Data Catalog other issues. Some information contained within the models help structure the catalog and enrich its content in a very useful way (for instance responsibilities, classifications, and of course business glossaries).

      The best approach is therefore, in our view, to conceive the catalog metamodel by focusing exclusively on the added value to the data teams (always with the same underlying question: does this information help find, localize, understand, and correctly use the data?), and then integrating the modeling tool and the Data Catalog in order to automate the supply of certain elements of the metamodel already present in the business model.

       

      Take Away

       As useful and complete as they may be, business models are still just models: they are an imperfect reflection of the operational reality of the systems and therefore they struggle to provide a useful Data Catalog.

      Modeling tools, as well as business models, are too complex and too abstract to be adopted by data teams. Our recommendation is that you define the metamodel of your catalog with a view to answering the questions of the data teams and supply some aspects of the metamodel with the business model.

      Download our eBook: The 7 lies of Data Catalog Providers for more!

      The 7 lies of Data Catalog Providers – #4 A Data Catalog is not a Query Solution!

      The 7 lies of Data Catalog Providers – #4 A Data Catalog is not a Query Solution!

      The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

       These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

      The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

      The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

      A Data Catalog is NOT a Query Solution

       

      Here is another oddity of the Data Catalog market. Several vendors, whose initial aim was to allow users to query simultaneously several data sources, have “pivoted” towards a Data Catalog positioning on the market.

      There is a reason for them to pivot.

      The emergence of Data Lakes and Big Data have cornered them in a technological cul-de-sac that has weakened the market segment they were initially in.

      A Data Lake is typically segmented into sever al layers. The “raw” layer integrates data without transformation, in formats that are more or less structured and in great quantities; A second layer, which we’ll call “clean”, will contain roughly the same data but in normalized formats, after a dust down. After that, there can be one or sever al “business” layers ready for use: A data warehouse and visualization tool for analytics, a Spark cluster for data science, a storage system for commercial distribution, etc. Within these layers, data is transformed, aggregated and optimized for use, along with the tools supporting this use (data visualization tools, notebooks, massive processing, etc).

       

      In this landscape, a universal self-service query tool isn’t suitable.

       

      It is of course possible to set up an SQL interpretation layer on top of the “clean” layer (like Hive) but query execution remains a domain for specialists. The volumes of data are huge and rarely indexed. 

      Allowing users to define their own queries is very risky: On on-prem systems, they run the risk of collapsing the cluster by running a very expensive query. And on the Cloud, the bill could run very high indeed. Not to mention security and data sensitivity issues.

       

      As for the “business” layers, they are generally coupled with more specialized solutions (such as a combination of Snowflake and Tableau for analytics) that offer very complete and secured tooling, offering great performance for self-service queries. With their market space shrinking like snow in the sun, some multi-source query vendors have pivoted towards Data Catalogs.

      Their pitch is now to convince customers that the ability to execute queries makes their solution the Rolls-Royce of Data Catalogs (In order to justify their six-figure pricing). We would invite you to think twice about it…

       

      Take Away

      On a modern data architecture, the capacity to execute queries from a Data Catalog isn’t just unnecessary, it’s also very risky (performance, cost, security, etc.).

      Data teams already have their own tools to execute queries on data, and if they haven’t, it may be a good idea to equip them. Integrating data access issues in the deployment of a catalog is the surest way to make it a long, costly, and disappointing project.

      Download our eBook: The 7 lies of Data Catalog Providers for more!

      What is Data Mesh?

      What is Data Mesh?

      In this new era of information, new terms are used in organizations working with data: Data Management Platform, Data Quality, Data Lake, Data warehouse… Behind each of these words we find specificities, technical solutions, etc. With Data Mesh, you go further by reconciling technical and functional management. Let’s decipher.

      Did you say: “Data Mesh”? Don’t be embarrassed if you’re not familiar with the concept. The term wasn’t used until 2019 as a response to the growing number of data sources and the need for business agility. 

      The Data Mesh model is based on the principle of a decentralized or distributed architecture exploiting a literal mesh of data.

      While a Data Lake can be thought of as a storage space for raw data, and the Data Warehouse is designed as a platform for collecting and analyzing heterogeneous data, Data Mesh responds to a different use case. 

      On paper, a Data Warehouse and Data Mesh have a lot in common, especially when it comes to their main purposes, which is to provide permanent, real-time access to the most up-to-date information possible. But Data Mesh goes further. The freshness of the information is only one element of the system.

      Because it is part of a distributed model, Data Mesh is designed to address each business line in your company with the key information that it concerns.

      To meet this challenge, Data Mesh is based on the creation of data domains. 

      The advantages? Your teams are more autonomous through local data management, a decentralization of your enterprise in order to aggregate more and more data, and finally, more control of the overall organization of your data assets.

       

      Data Mesh: between logic and organization

      If a Data Lake is ultimately a single reservoir for all your data, Data Mesh is the opposite. Forget the monolithic dimension of a Data Lake. Data is a living, evolving asset, a tool for understanding your market and your ecosystem and an instrument of knowledge and understanding. 

      Therefore, in order to appropriate the concept of meshing data, you need to think differently about data. How can we do this? By laying the foundations for a multi-domain organization. Each type of data has its own use, its own target, and its own exploitation. From then on, all the business areas of your company will have to base their actions and decisions on the data that is really useful to them to accomplish their missions. The data used by marketing is not the same as the data used by sales or your production teams. 

      The implementation of a Data Catalog is therefore the essential prerequisite for the creation of a Data Mesh. Without a clear vision of your data’s governance, it will be difficult to initiate your company’s transformation. Data quality is also a central element. But ultimately, Data Mesh will help you by decentralizing the responsibility for data to the domain level and by delivering high-quality transformed data.

      The Challenges

      Does adopting Data Mesh seem impossible because the project seems both complex and technical? No cause for panic! Data Mesh, beyond its technicality, its requirements, and the rigor that goes with it, is above all a new paradigm. It must lead all the stakeholders in your organization to think of data as a product addressed to the business. 

      In other words, by moving towards a Data Mesh model, the technical infrastructure of the data environment is centralized, while the operational management of the data is decentralized and entrusted to the business.

      With Data Mesh, you create the conditions for an acculturation to data for all your teams so that each employee can base his or her daily action on data.

       

      The Data Mesh paradox

      Data Mesh is meant to put data at the service of the business. This means that your teams must be able to access it easily, at any time, and to manipulate the data to make it the basis of their daily activities.

      But in order to preserve the quality of your data, or to guarantee compliance with governance rules, change management is crucial and the definition of each person’s prerogatives is decisive. When deploying Data Mesh, you will have to lay a sound foundation in the organization. 

      On the one hand, free access to data for each employee (what we call functional governance). On the other hand, management and administration, in other words, technical governance in the hands of the Data teams.

      Decompartmentalizing uses by compartmentalizing roles, that’s the paradox of Data Mesh! 

      The 7 lies of Data Catalog Providers – #3 A Data Catalog is not a Compliance Solution!

      The 7 lies of Data Catalog Providers – #3 A Data Catalog is not a Compliance Solution!

      The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

       These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

      The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

      The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

      A Data Catalog is NOT a Compliance Solution

       

      As with governance, regulatory compliance is a crucial issue for any data-centric organization.

      There is a plethora of data handling regulations spanning all sectors of activity and countries. On the subject of personal data alone, GDPR is mandatory across all EU countries, but each State has a lot of wiggle room on how its implemented, and most States have a large arsenal of legislation to complete, reinforce and adapt it (Germany alone for instance, has several dozen regulations across different sectors of activity related to personal data).

      In the US, there are hundreds of laws and regulations across States and sectors of activity (with varying degrees of adherence). And here we are only referring to personal data…Rules and regulations also exist for financial data, medical data, biometric data, banking data, risk data, insurance data etc. Put simply, every organization has some regulation it has to be in compliance with.

       

      So what does compliance mean in this case?

      The vast majority of regulatory audits center on the following:

      • The ability to provide complete and up to date documentation on the procedures and controls put in place in order to meet the norms,
      • The ability to prove that the procedures described in the documentation are rolled out in the field,
      • The ability to supervise all the measures deployed with a view towards continuous improvement.

      A Data Catalog is neither a procedures library, or an evidence consolidation system, and even less a process supervision solution.

      It strikes us as obvious that assigning those responsibilities to a Data Catalog will make it considerably less simple to use (norms are too obscure for most people) and will jeopardize adoption for those most likely to benefit from it (data teams).

      Should we therefore forget about Data Catalogs in our quest for compliance?

       

      No, of course not. Again, in terms of compliance, it would be much wiser to use the Da ta Catalog for the literacy of the data teams. And to tag the data appropriately thus, enabling the teams to quickly identify any norm or procedure they need to adhere to before using the data. The Catalog can even help place the tags using a variety of approaches. It can for example automatically detect sensitive or personal data.

      That said, even with the help of ML, detection will never work perfectly ( the notion of “personal data” defined by GDPR for instance, is much larger and harder to detect than North American PII). The Catalog’s ability to manage these tags is therefore critical.

       

      Take Away

      Regulatory compliance is above all a matter of documentation and proof and has no place in a Data Catalog.

      However, the Data Catalog can help identify (more or less automatically) data that is subject to regulations. The Data Catalog plays a key role in the acculturation of the data teams with respect to the importance of regulations.

      Download our eBook: The 7 lies of Data Catalog Providers for more!

      The 7 lies of Data Catalog Providers – #2 A Data Catalog is NOT a Data Quality Management Solution

      The 7 lies of Data Catalog Providers – #2 A Data Catalog is NOT a Data Quality Management Solution

      The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

       These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

      The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

      The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

      A Data Catalog is NOT a Data Quality Management (DQM) Solution

       

      We at Zeenea, do not underestimate the importance of data quality in successfully delivering a data project, quite the contrary. It just seems absurd to me to put this in the hands of a solution, which by its very nature, cannot achieve the controls at the right time.

      Let us explain.

      There is a very elementary rule to quality control, a rule that can be applied virtually in any domain where quality is an issue, be it an industrial production chain, software development, or the cuisine of a 5-star restaurant: The sooner the problem is detected, the less it will cost to correct.

      To demonstrate the point, a car manufacturer is unlikely to refrain from testing the battery of a new vehicle until after its built and all the production costs have already been incurred and solving a defect would cost the most. No. Each piece is closely controlled, each step of the production is tested, defective pieces are removed before ever being integrated in the production circuit, and the entire chain of production can be halted if quality issues are detected at any stage. The quality issues are corrected at the earliest possible state of the production process where they are the least costly and the most durable.

       

      “In a modern data organization, data production rests on the same principles. We are dealing with an assembly chain whose aim is to provide usage with high added value. Quality control and correction must happen at each step. The nature and level of controls will depend on what the data is used for.”

       

      If you are handling data, you obviously have at your disposal pipelines to feed your uses. These pipelines can involve dozens of steps – data acquisition, data cleaning, various transformations, mixing various data sources, etc.

      In order to develop these pipelines, you probably have a number of technologies at play, anything from in-house scripts to costly ETLs and exotic middleware tools. It’s within those pipelines that you need to insert and pilot your quality control, as early as possible, by adapting them to what is at stake with the end product. Only measuring data quality levels at the end of the chain isn’t just absurd, it’s totally inefficient.

      It is therefore difficult to see how a Da ta Catalog (whose purpose is to inventory and document all potentially usable datasets in order to facilitate data discovery and usage) can be a useful tool to measure and manage quality.

      A Data Catalog operates on available datasets, on any systems that contain data, and should be as least invasive as possible in order to be deployed quickly throughout the organization.

      A DQM solution works on the data feed (the pipelines), focuses on production data and is, by design, intrusive and time consuming to deploy. I cannot think of any software architecture that can tackle both issues without compromising the quality of either one.

       

      Data Catalog vendors promising to solve your data quality issues are, in our opinion, in a bind and it seems unlikely they can go beyond a “salesy” demo.

       

      As for DQM vendors (who also often sell ETLs), their solutions are often too complex and costly to deploy as credible Data Catalogs.

      The good news is that the orthogonal nature of data quality and data cataloging makes it easy for specialized solutions in each domain to coexist without encroaching on each other’s lane.

      Indeed, while a data catalog isn’t purposed for quality control, it can exploit the information on the quality of the datasets it contains which obviously provides many benefits.

      The Data Catalog uses this metadata for example to share the information (and possible alerts it may identify) with the data consumers. The catalog can benefit from this information to adjust his search and recommendation engine and thus, orientate other users towards higher quality datasets.

      And both solutions can be integrated at little cost with a couple of APIs here and there.

       

      Take Away

      Data quality needs to be assessed as early as possible in the pipeline feeds.

      The role of the Data Catalog is not to do quality control but to share as much as possible the results of these controls. By their natures, Data Catalogs are bad DQM solutions, and DQM solutions are mediocre and overly complex Data Catalogs.

      An integration between a DQM solution and a Data Catalog is very straightforward and is the most pragmatic approach.

      Download our eBook: The 7 lies of Data Catalog Providers for more!