metadata management Archives

HCLSoftware Completes Acquisition of Metadata Management Software Provider Zeenea

by Anthony Corbeaux | Sep 12, 2024 | News & events

Acquisition will enable Actian, a division of HCLSoftware, to offer customers a complete data ecosystem.

SANTA CLARA, Calif., September 12, 2024 – HCLSoftware, the software business division of HCLTech, today announced that it completed the acquisition of Zeenea, an innovator in data catalog and governance solutions based in Paris, France. The acquisition of Zeenea enables Actian, a division of HCLSoftware, to offer a unified data intelligence and governance solution that empowers customers to seamlessly discover, govern, and maximize the value of their data assets. It also further extends Actian’s presence, workforce, and customer base in Europe.

“To become data-driven, organizations of all sizes need data governance to ensure the most effective and efficient use of quality data throughout its life cycle,” said Marc Potter, CEO at Actian. “With Zeenea as part of our portfolio, Actian offers customers a complete data ecosystem – making Actian a one-stop shop for all things data. Together, we will help our customers propel their GenAI and analytics initiatives forward by boosting confidence in data preparation and enhancing data readiness.”

According to ISG Research, 85% of enterprises believe that investment in generative AI technology in the next 24 months is critical. Data governance and data quality work in tandem to ensure the data feeding the GenAI solutions is accurate, complete, fit-for-purpose, and used according to governance policies.

Zeenea is recognized for its cloud-native Data Discovery Platform with universal connectivity that supports metadata management applications from search and exploration to data catalog, lineage, governance, compliance and enterprise data marketplace. Powered by an adaptive knowledge graph, Zeenea enables organizations to democratize data access and generate a 360-degree view of their assets, including the relationships between them.

About Actian

Actian  makes data easy. We deliver a complete data solution that simplifies how people connect, manage, govern and analyze data. We transform business by enabling customers to make confident, intelligent, data-driven decisions that accelerate their organization’s growth. Our data platform integrates seamlessly, performs reliably, and delivers industry-leading speeds at an affordable cost. Actian is a division of HCLSoftware.

About HCLSoftware

HCLSoftware is the software business division of HCLTech, serving more than 7,000 organizations in 130 countries in five key areas: Data and Analytics; Business and Industry Applications (including Commerce, MarTech Automation); Intelligent Operations; Total Experience; and Cybersecurity.

https://www.hcl-software.com/

For further details, please contact:

Danielle Lee, ActianDanielle.Lee@actian.com

Jeremy McNeive, HCLSoftware jeremy.mcneive@hcl.com

Harnessing the Power of AI in Data Cataloging

by Zeenea Software | Jul 8, 2024 | Data Catalog

In today’s era of expansive data volumes, AI stands at the forefront of revolutionizing how organizations manage and extract value from diverse data sources. Effective data management becomes paramount as businesses grapple with the challenge of navigating vast amounts of information. At the heart of these strategies lies data cataloging—an essential tool that has evolved significantly with the integration of AI, with promises of efficiency, accuracy, and actionable insights. Let’s see how in this article.

The benefits of AI in data cataloging

AI revolutionizes data cataloging by automating and enhancing traditionally manual processes, thereby accelerating efficiency and improving data accuracy across various functions:

Automated metadata generation

AI algorithms autonomously generate metadata by analyzing and interpreting data assets. This includes identifying data types, relationships, and usage patterns. Machine learning models infer implicit metadata, ensuring comprehensive catalog coverage. Automated metadata generation reduces the burden on data stewards and ensures consistency and completeness in catalog entries. This capability is precious in environments with rapidly expanding data volumes where manual metadata creation could be more practical.

Simplified data classification and tagging

AI facilitates precise data classification and tagging using natural language processing (NLP) techniques. By understanding contextual nuances and semantics, AI enhances categorization accuracy, which is particularly beneficial for unstructured data formats such as text and multimedia. Advanced AI models can learn from historical tagging decisions and user feedback to improve classification accuracy. This capability simplifies data discovery processes and enhances data governance by consistently and correctly categorizing data.

Enhanced Search capabilities

AI-powered data catalogs feature advanced search capabilities that enable swift and targeted data retrieval. AI recommends relevant data assets and related information by understanding user queries and intent. Through techniques such as relevance scoring and query understanding, AI ensures that users can quickly locate the most pertinent data for their needs, thereby accelerating insight generation and reducing time spent on data discovery tasks.

Robust Data lineage and governance

AI is crucial in tracking data lineage by tracing its origins, transformations, and usage history. This capability ensures robust data governance and compliance with regulatory standards. Real-time lineage updates provide a transparent view of data provenance, enabling organizations to maintain data integrity and traceability throughout its lifecycle. AI-driven lineage tracking is essential in environments where data flows through complex pipelines and undergoes multiple transformations, ensuring all data usage is documented and auditable.

Intelligent Recommendations

AI-driven recommendations empower users by suggesting optimal data sources for analyses and identifying potential data quality issues. These insights derive from historical data usage patterns. Machine learning algorithms analyze past user behaviors and data access patterns to recommend datasets that are likely to be relevant or valuable for specific analytical tasks. By proactively guiding users toward high-quality data and minimizing the risk of using outdated or inaccurate information, AI enhances the overall effectiveness of data-driven operations.

Anomaly Detection

AI-powered continuous monitoring detects anomalies indicative of data quality issues or security threats. Early anomaly detection facilitates timely corrective actions, safeguarding data integrity and reliability. AI-powered anomaly detection algorithms utilize statistical analysis and machine learning techniques to identify deviations from expected data patterns.

This capability is critical in detecting data breaches, erroneous data entries, or system failures that could compromise data quality or pose security risks. By alerting data stewards to potential issues in real-time, AI enables proactive management of data anomalies, thereby mitigating risks and ensuring data consistency and reliability.

The challenges and considerations of AI in data cataloging

Despite its advantages, AI-enhanced data cataloging presents challenges requiring careful consideration and mitigation strategies.

Data Privacy and Security

Protecting sensitive information requires robust security measures and compliance with data protection regulations such as GDPR. AI systems must ensure data anonymization, encryption, and access control to safeguard against unauthorized access or data breaches.

Scalability

Implementing AI at scale demands substantial computational resources and scalable infrastructure capable of handling large volumes of data. Organizations must invest in robust IT frameworks and cloud-based solutions to support AI-driven data cataloging initiatives effectively.

Data Integration

Harmonizing data from disparate sources into a cohesive catalog remains complex, necessitating robust integration frameworks and data governance practices. AI can facilitate data integration by automating data mapping and transformation processes. However, organizations must ensure compatibility and consistency across heterogeneous data sources.

In conclusion, AI’s integration into data cataloging represents a transformative leap in data management, significantly enhancing efficiency and accuracy. AI automates critical processes and provides intelligent insights to empower organizations to exploit their data assets fully in their data catalog. Furthermore, overcoming data privacy and security challenges is essential for successfully integrating AI. As AI technology advances, its role in data cataloging will increasingly drive innovation and strategic decision-making across industries.

[SERIES] Data Shopping Part 2 – The Zeenea Data Shopping Experience

by Zeenea Software | Jun 24, 2024 | Data Catalog, Data Mesh

Just as shopping for goods online involves selecting items, adding them to a cart, and choosing delivery and payment options, the process of acquiring data within organizations has evolved in a similar manner. In the age of data products and data mesh, internal data marketplaces enable business users to search for, discover, and access data for their use cases.

In this series of articles, get an excerpt from our Practical Guide to Data Mesh and discover all there is to know about data shopping as well as Zeenea’s Data Shopping experience in its Enterprise Data Marketplace:

How to shop for data products
The Zeenea Data Shopping experience

—–

In our previous article, we discussed the concept of data shopping within an internal data marketplace, addressing elements such as data product delivery and access management. In this article, we will explore the reason behind Zeenea’s decision to extend its data shopping experience beyond internal boundaries, as well as how our interface, Zeenea Studio, enables the analysis of the overall performance of your data products.

Data Product Shopping in Zeenea

In our previous article, we discussed the complexities of access rights management for data products due to the inherent risks of data consumption. In a decentralized data mesh, the data product owner assesses risks, grants access, and enforces policies based on the data’s sensitivity, the requester’s role, location, and purpose. This may involve data transformation or additional formalities, with delivery ranging from read-only access to fine-grained controls.

In a data marketplace, consumers trigger a workflow by submitting access requests, which data owners evaluate and determine access rules for, sometimes with expert input. For Zeenea’s marketplace we have chosen not to integrate this workflow directly into the solution but rather to interface with external solutions.

The idea is to offer a uniform experience for triggering an access request but to accept that the processing of this request may be very different from one environment to another, or even from one domain to another within the same organization – This principle is inherited from classical marketplaces. Most marketplaces offer a unique experience for making a purchase but connect to other systems for the operational implementation of delivery – the modalities of which can vary widely depending on the product and the seller.

This decoupling between the shopping experience and the operational implementation of delivery seems essential to us for several reasons.

The main reason is the extreme variability of the processes involved. Some organizations already have operational workflows, relying on a larger solution (data access requests are integrated into a general access request process, supported, for example, by a ticketing tool such as ServiceNow or Jira). Others have dedicated solutions supporting a high level of automation but whose deployment is not yet widespread. Still, others rely on the capabilities of their data platform, and some even on nothing at all – access is obtained through direct requests to the data owner, who handles them without a formal process. This variability is evident from one organization to another but also within the same organization – structurally, when different domains use different technologies, or temporally when the organization decides to invest in a more efficient or secure system and must gradually migrate access management to this new system.

Decoupling, therefore, allows offering a consistent experience to the consumer while adapting to the variability of operational methods.

For a data marketplace customer, the shopping experience is very simple. Once the data product(s) of interest is identified, they trigger an access request by providing the following information:

Who they are – this information is already available.
Which data product they want to access – this information is also already available, along with the metadata needed for decision-making.
What they intend to use the data for – this is crucial since it drives risk management and compliance requirements.

With Zeenea, once the access request is submitted, it is processed in another system, and its status can be tracked from the marketplace – this is the direct equivalent of order tracking found on e-commerce sites.

From the consumer’s perspective, the data marketplace provides a catalog of data products (and other digital products) and a simple, universal system for gaining access to these products.

For the producer, the marketplace plays a fundamental role in managing their product portfolio.

Enhance Data Product performance with Zeenea Studio

As mentioned earlier, in addition to the e-commerce system, which is intended for consumers, a classical marketplace also offers tools dedicated to sellers, allowing them to supervise their products, respond to buyer inquiries, and monitor the economic performance of their offerings. And other tools, intended for marketplace managers, to analyze the overall performance of products and sellers.

Zeenea’s Enterprise Data Marketplace integrates these capabilities into a dedicated back-office tool, Zeenea Studio. It allows for managing the production, consolidation, and organization of metadata in a private catalog and deciding which objects will be placed in the marketplace – which is a searchable space accessible to the widest audience.

These activities primarily fall under the production process – metadata are produced and organized together with the data products. However, it also allows for monitoring the use of each data product, notably by providing a list of all its consumers and the uses associated with them.

This consumer tracking helps establish the two pillars of data mesh governance:

Compliance and risk management – by conducting regular reviews, certifications, and impact analyses during data product changes.

Performance management – the number of consumers, as well as the nature of the uses made of them, are the main indicators of a data product’s value. Indeed, a data product that is not consumed has no value.

As a support tool for domains to control the compliance of their products and their performance, Zeenea’s Enterprise Data Marketplace also offers comprehensive analysis capabilities of the mesh – the lineage of data products, scoring, and evaluation of their performance, control of overall compliance and risks, regulatory reporting elements, etc.

This is the magic of the federated graph, which allows for exploiting information at all scales and provides a comprehensive representation of the entire data landscape.

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Written by Guillaume Bodet, co-founder & CPTO at Zeenea, our guide was designed to arm you with practical strategies for implementing data mesh in your organization, helping you:

✅ Start your data mesh journey with a focused pilot project
✅ Discover efficient methods for scaling up your data mesh
✅ Acknowledge the pivotal role an internal marketplace plays in facilitating the effective consumption of data products
✅ Learn how Zeenea emerges as a robust supervision system, orchestrating an enterprise-wide data mesh

Get the ebook

[SERIES] Building a Marketplace for Data Mesh Part 3: Feeding the Marketplace via domain-specific data catalogs

by Zeenea Software | Jun 10, 2024 | Data Catalog, Data Mesh

Over the past decade, data catalogs have emerged as important pillars in the landscape of data-driven initiatives. However, many vendors on the market fall short of expectations with lengthy timelines, complex and costly projects, bureaucratic data governance models, poor user adoption rates, and low-value creation. This discrepancy extends beyond metadata management projects, reflecting a broader failure at the data management level.

Given these shortcomings, a new concept is gaining popularity, the internal marketplace, or what we call the Enterprise Data Marketplace (EDM) at Zeenea.

In this series of articles, get an excerpt from our Practical Guide to Data Mesh where we explain the value of internal data marketplaces for data product production and consumption, how an EDM supports data mesh exploitation on a larger scale, and how they go hand-in-hand with a data catalog solution:

Facilitating data product consumption through metadata
Setting up an enterprise-level marketplace
Feeding the marketplace via domain-specific data catalogs

—–

Structuring data management around domains and data products is an organizational transformation that does not change the operational reality of most organizations: data is available in large quantities, from numerous sources, evolves rapidly, and its control is complex.

Data catalogs traditionally serve to inventory all available data and manage a set of metadata to ensure control and establish governance practices.

Data mesh does not eliminate this complexity: it allows certain data, managed as data products, to be distinguished and intended for sharing and use beyond the domain to which they belong. But each domain is also responsible for managing its internal data, those that will be used to develop robust and high-value data products – its proprietary data, in other words.

Metadata management in the context of an internal marketplace fed by domain-specific catalogs

In the data mesh, the need for a Data Catalog does not disappear, quite the contrary: each domain should have a catalog allowing it to efficiently manage its proprietary data, support domain governance, and accelerate the development of robust and high-value data products. Metadata management is thus done at two levels:

At the domain level – in the form of a catalog allowing the documentation and organization of the domain’s data universe. Since the Data Catalog is a proprietary component, it is not necessary for all domains to use the same solution.

At the mesh level – in the form of a marketplace in which the data products shared by all domains are registered; the marketplace is naturally common to all domains.

With a dedicated marketplace component, the general architecture for metadata management is as follows:

Architecture Générale Pour La Gestion Des Métadonnées

In this architecture, each domain has its own catalog – which may rely on a single solution or not – but should be instantiated for each domain to allow it to organize its data most effectively and avoid the pitfalls of a universal metadata organization.

The marketplace is a dedicated component, offering simplified ergonomics, and in which each domain deploys metadata (or even data) for its data products. This approach requires close integration of the different modules:

Domain catalogs must be integrated with the marketplace to avoid duplicating efforts in producing certain metadata – especially lineage, but also data dictionaries (schema), or even business definitions that will be present in both systems.

Domain catalogs potentially need to be integrated with each other – to share/synchronize certain information, primarily the business glossary but also some repositories.

Data catalog vs EDM capabilities

When we look at the respective capabilities of an Enterprise Data Marketplace and a Data Catalog, we realize that these capabilities are very similar:

Data Catalog Vs Enterprise Data Marketplace

In the end, on a strictly functional level, their capabilities are very similar. What distinguishes a modern Data Catalog from an EDM are:

Their scope – The Data Catalog is intended to cover all data, whereas the marketplace is limited to the objects shared by domains (data products and other domain analytics products).

Their user experience – The Data Catalog is often a fairly complex tool, designed to support governance processes globally – it focuses on data stewardship workflows. The marketplace, on the other hand, typically offers very simple ergonomics, heavily inspired by that of an e-commerce platform, and provides an experience centered on consumption – data shopping.

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Written by Guillaume Bodet, co-founder & CPTO at Zeenea, our guide was designed to arm you with practical strategies for implementing data mesh in your organization, helping you:

Get the ebook

[SERIES] Building a Marketplace for Data Mesh Part 2: Setting up an enterprise-level marketplace

by Zeenea Software | Jun 3, 2024 | Data Catalog, Data Mesh

Given these shortcomings, a new concept is gaining popularity, the internal marketplace, or what we call the Enterprise Data Marketplace (EDM) at Zeenea.

Facilitating data product consumption through metadata
Setting up an enterprise-level marketplace
Feeding the marketplace via domain-specific data catalogs

—–

As described in our previous article, an Enterprise Data Marketplace is a simple system in which consumers can search among the data product offerings for one or more eligible to perform a specific use case, become aware of the information related to these products, and then order them. The order materializes as access opening, physical data delivery, or even a request for data product evolution to cover the new use case.

Three main options for setting up an internal data marketplace

When establishing an internal data marketplace, organizations typically consider three primary approaches:

Develop it

This approach involves building a custom data marketplace tailored to the organization’s unique requirements. While offering the potential for a finely tuned user experience, this option often entails significant time and financial investment.

Integrate a solution from the market

Alternatively, organizations can opt for pre-existing solutions available in the market. Originally designed for data commercialization or external data exchange, these solutions can be repurposed for internal use. However, they may require customization to align with internal workflows and security standards.

Use existing systems

Some organizations choose to leverage their current infrastructure by repurposing tools such as data catalogs and corporate wikis. While this approach may offer familiarity and integration with existing workflows, it might lack the specialized features of dedicated data marketplace solutions.

The drawbacks of commercial marketplaces

Although often offering a satisfying user experience and native support for the data product concept, commercial marketplaces often have significant drawbacks: highly focused on transactional aspects (distribution, licensing, contracting, purchase or subscription, payment, etc.), they are often poorly integrated with internal data platforms and access control tools. They generally require data to be distributed by the marketplace, meaning they constitute a new infrastructure component onto which data must be transferred and shared (such a system is sometimes called a Data Sharing Platform).

Zeenea’s Enterprise Data Marketplace

In a pragmatic approach, it is not desirable to introduce a new infrastructure component to deploy a data mesh – it seems highly preferable to leverage existing capabilities as much as possible.

Therefore, at Zeenea, we’ve evolved our data discovery platform and data catalog to offer a unique solution, one that mirrors the data mesh at the metadata level to continually adapt to the organization’s evolving data platform architecture. This Enterprise Data Marketplace (EDM) integrates a cross-domain marketplace with private data catalogs tailored to each domain’s needs.

An approach that we detail in the next article of our series, made possible by what has long distinguished Zeenea and differentiates it from most other data catalogs or metadata management platform vendors: an evolving knowledge graph.

In our final article, discover how an internal data marketplace paired with domain-specific catalogs, provides a comprehensive data mesh supervision system.

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Written by Guillaume Bodet, co-founder & CPTO at Zeenea, our guide was designed to arm you with practical strategies for implementing data mesh in your organization, helping you:

Get the ebook

[SERIES] Building a Marketplace for data mesh Part 1: Facilitating data product consumption through metadata

by Zeenea Software | May 28, 2024 | Data Catalog, Data Mesh

Given these shortcomings, a new concept is gaining popularity, the internal marketplace, or what we call the Enterprise Data Marketplace (EDM) at Zeenea.

Facilitating data product consumption through metadata
Setting up an enterprise-level marketplace
Feeding the marketplace via domain-specific data catalogs

—–

Before diving into the internal marketplace, let’s quickly go back to the notion of a data product, which we believe is the cornerstone of the data mesh and the first step in transforming data management.

Sharing and exploiting data products through metadata

As mentioned in our previous series on data mesh, a data product is a governed, reusable, scalable dataset offering data quality and compliance guarantees to various regulations and internal rules. Note that this definition is quite restrictive – it excludes other types of products such as machine learning algorithms, models, or dashboards.

While these artifacts should be managed as products, they are not data products – They are other types of products, which could be very generally termed “Analytics Products”, of which data products are one subset.

In practice, an operational data product consists of two things:

1. Data - materialized on a centralized or decentralized data platform, guaranteeing data addressing, interoperability, and access security.
2. Metadata - providing all the necessary information for sharing and using the data.

Metadata ensures consumers have all the information they need to use the product.

It typically covers the following aspects:

Schema – providing the technical structure of the data product, data classification, samples, and their origin (lineage).

Governance – identifying the product owner(s), its successive versions, its possible deprecation, etc.

Semantics – providing a clear definition of the exposed information, ideally linked to the organization’s business glossary and comprehensive documentation of the data product.

Contract – defining quality guarantees, consumption modalities (protocols and security), potential usage restrictions, redistribution rules, etc.

In the data mesh logic, these metadata are managed by the product team and are deployed according to the same lifecycle as data and pipelines. There remains a fundamental question: where can metadata be deployed?

Using a data marketplace to deploy metadata

Most organizations already have a metadata management system, usually in the form of a Data Catalog.

But data catalogs, in their current form, have major drawbacks:

They don’t always support the notion of a data product – it must be more or less emulated with other concepts.

They are complex to use – designed to catalog a large number of assets with sometimes very fine granularity, they often suffer from a lack of adoption beyond centralized data management teams.

They mostly impose a rigid and unique organization of data, decided and designed centrally – which fails to reflect the variety of different domains or the organization’s evolution as the data mesh expands.

Their search capabilities are often limited, particularly for exploratory aspects – it’s often necessary to know what you’re looking for to be able to find it.

The experience they offer sometimes lacks the simplicity users aspire to – search with a few keywords, identify the appropriate data product, and then trigger the operational process of an access request or data delivery.

The internal marketplace, or Enterprise Data Marketplace (EDM) is therefore a new concept gaining popularity in the data mesh circle. Like a general-purpose marketplace, the EDM aims to provide a shopping experience for data consumers. It is thus an essential component to ensure the exploitation of the data mesh on a larger scale – it allows data consumers to have a simple and effective system to search for and access data products from various domains.

In our next article, learn the different ways to set up an internal data marketplace, and how it is essential for data mesh exploitation.

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Written by Guillaume Bodet, co-founder & CPTO at Zeenea, our guide was designed to arm you with practical strategies for implementing data mesh in your organization, helping you:

Get the ebook

Data Mesh 101: Best Practices for Metadata Management

by Zeenea Software | Jan 14, 2024 | Data Mesh, Metadata Management

In the ever-evolving landscape of data management, organizations are shifting towards new innovative approaches to tackle the complexities of their data landscapes. One such notable trend gaining substantial momentum is the concept of Data Mesh – a decentralized approach to data architecture, emphasizing autonomous, domain-oriented data products.

As we embark on this journey of decentralized data, let’s dig into the vital role of metadata and the importance of effectively managing it in the context of Data Mesh.

The role of Metadata

Metadata, often referred to as ‘data about data,’ plays a fundamental role in shaping a functional data ecosystem. It extends beyond the simple task of describing datasets; rather, it involves understanding the data’s origins, quality, transformations, etc. The different types of metadata include:

Technical Metadata: focuses on the technical aspects of data, such as data formats, schema, data lineage, and storage details.
Business Metadata: Business metadata revolves around the business context of data. It includes information about data ownership, business rules, data definitions, and any other details that help align data assets with business objectives.
Operational Metadata: Operational metadata provides insights into the day-to-day operations related to data. This includes information about data processing workflows, data refresh schedules, and any operational dependencies.
Collaborative Metadata: Collaborative metadata captures information about user interactions, annotations, and comments related to data assets.

In the decentralized framework of Data Mesh, metadata serves as the link, bridging different data domains with these different types of metadata. As data moves among different teams, metadata becomes the guide, assisting everyone in navigating the diverse data landscape.
Metadata therefore acts as a valuable aid by providing insights into the structure and content of their assets. It facilitates data discovery for users, making it easier to discern and locate specific data that aligns with their needs.

Additionally, metadata forms the basis for data governance, providing a framework for enforcing quality standards, security protocols, and compliance measures uniformly across diverse domains. It plays a critical role in access control and ensures that users are not only informed but also adhere to the defined access policies.

Challenges of Managing Metadata in Data Mesh

One significant challenge stems from the decentralized nature of a Data Mesh. In a traditional centralized data architecture, metadata management is often handled by a dedicated team or department, ensuring consistency and standardization. However, in a Data Mesh, each domain team is responsible for managing its own metadata. This decentralized approach can lead to variations in metadata practices across different domains, making it challenging to maintain uniform standards and enforce data governance policies consistently.

The diversity of data sources and domains within a Data Mesh is another notable challenge in metadata management. Different domains may use various tools, schemas, and structures for organizing and describing their data. Managing metadata across these diverse sources requires establishing common metadata standards and ensuring compatibility, which can be a complex and time-consuming task. The heterogeneity of data sources adds a layer of intricacy to the creation of a cohesive and standardized metadata framework.

Ensuring consistency and quality across metadata is an ongoing challenge in a Data Mesh environment. With multiple domain teams independently managing their metadata, maintaining uniformity becomes a constant effort – Inconsistencies in metadata can lead to misunderstandings, misinterpretations, and errors in data analysis.

Best Practices for Managing Data in Data Mesh

To overcome these challenges, here are some best practices for managing metadata for your organization.

First, establishing clear and standardized metadata definitions across diverse domains is essential for ensuring consistency, interoperability, and a shared understanding of data elements. Clear definitions provide a common language and framework that ensures consistency in how data is described and understood across the organization.

Furthermore, standardized metadata definitions play a pivotal role in data governance. They provide a basis for uniformly enforcing data quality standards, security protocols, and compliance measures across diverse domains. This ensures that data is not only described consistently but also adheres to organizational policies and regulatory requirements, contributing to a robust and compliant data ecosystem.

However, it’s equally important to empower domain teams with ownership of their metadata. This decentralized approach fosters a sense of responsibility and expertise among those who know the data best. By giving domain teams control over their metadata, organizations leverage their specific knowledge to ensure accuracy, consistency, and trustworthiness across all data domains. This approach promotes adaptability within individual domains, contributing to a more reliable and informed data management strategy.

This dual strategy allows for both centralized governance, ensuring organization-wide standards, and decentralized ownership, promoting agility and domain-specific knowledge within the landscape of a Data Mesh.

The Guide to Understanding the Difference Between a Business Glossary, a Data Catalog, and a Data Dictionary

by Zeenea Software | May 21, 2023 | Data Catalog, Metadata Management

You’ve put data at the center of your company’s business strategy, but the amount of data you have to handle is exploding. You, therefore, not only need 360° visibility on your data portfolio but also a vision of the uses that can be made of it.

To do this, you can combine the actions and benefits of three essential pillars: the data catalog, the data dictionary, and the business glossary. Read this article to discover more.

Producing data is great. Gaining business knowledge from it is even better! Because the successful implementation of a data culture is a top priority of your business strategy, you need to transform the available information into operational tools for decision-making. By bridging together data and business, you will give your company (and your teams!) a new impetus.

But to achieve this, you must rely on three essential pillars: a data catalog, a data dictionary, and a business glossary. Three essential tools that will help you organize and improve your data management strategy: Although they are related, these tools are actually quite different!

What is a data catalog and what are its benefits?

A data catalog is a detailed inventory that lists all information from all of your organization’s data sources. Once unified in the catalog, it is more accessible, understandable, and actionable by your teams. A data catalog can collect and inventory several types of information such as datasets and their associated fields, data processes, visualizations, glossary objects (see section below), or even custom information specific to your company.

The data catalog plays a crucial role in your data strategy because it allows you to efficiently get an overview of your data, its quality, and availability, as well as its associated metadata such as its definition, associated contacts, provenance, format, etc. Another main advantage of using a data catalog is that it encourages the collaboration and sharing of data in all departments of your organization. It allows your teams to work together to identify, understand, and use data more effectively.

Finally, by centralizing the available information, a data catalog allows you to maintain a high level of data quality by ensuring that data is correctly identified, classified, documented, and maintained.

Why implement a business glossary and for what purpose

A business glossary is an essential component that helps establish a common understanding of business terms and definitions used in the organization. Its role: to facilitate communication and reduce errors or misunderstandings related to the use of your organization’s terms. It can include technical or financial definitions, procedures, or any other subject relevant to your organization!

By having a business glossary, you will almost mechanically improve data quality by ensuring that data is clearly defined and understood. It helps by reducing data entry errors, standardizing data formats, and increasing the reliability and accuracy of data.

Furthermore, the business glossary also helps you better manage regulatory compliance by standardizing the terms and definitions used in compliance reports and documents.

Finally, your Business Glossary contributes to faster and more reliable decision-making by providing a common knowledge base for all stakeholders in the decision-making chain.

What are the differences with a data dictionary?

A data dictionary is a third tool that will help you strengthen and boost your data strategy. This data management tool provides detailed information about the data used in your business based on a set of metadata. This metadata describes the data, its structure, its format, its meaning, its owner, and its use.

This description helps your employees, those who use data on a daily basis, to understand the data and to use it appropriately. A data dictionary is also a key tool for data quality management, as it allows you to monitor data quality by identifying errors and inconsistencies. It also facilitates data reuse, providing information about existing data and its meaning, making it easy to integrate into new applications or projects.

Want to give your data strategy a boost? Combining a business glossary, a data catalog, and a data dictionary will give you a complete and consistent view of the data and business terms used in your company.

Metadata management vs. master data management: the differences and similarities

by Zeenea Software | Apr 3, 2023 | Metadata Management

In order to stand out from the competition, offer an adapted customer experience, reinforce innovation, and improve internal processes or production flows, companies are strongly relying on data. Many organizations are looking for ways to better leverage this massive resource and ensure rigorous data governance. In this article, discover the differences as well as the similarities between two concepts that are essential to a data-driven approach: metadata management and master data management.

According to a study entitled “The Strategic Role of Data Governance and its Evolution“, conducted by the Enterprise Strategy Group (ESG), at the end of 2022, companies saw their data volumes double every two years. On average, the organizations that were surveyed reported managing 3 petabytes of data, about two-thirds of which is unstructured data. The survey data also shows an average annual increase of 40%. And 32% of respondents even report a yearly increase of more than 50%!

In the context of exponentially increasing data volumes, companies must face a major challenge: ensure optimal data and metadata management, at the risk of exposing themselves to an explosion of costs related to errors. According to recent Gartner estimates, poor data quality costs companies in all industries nearly $13 billion annually. Metadata management and master data management (MDM) provide organizations with critical processes to gain the knowledge they need to meet their market challenges while limiting their exposure to the risk of excess costs.

The definitions of metadata management & master data management

First, let’s define each term clearly. Metadata management corresponds to the set of practices and tools that enable the management of metadata of an information system in an efficient and consistent way. As such, metadata management aims to guarantee the quality, relevance, and accessibility of metadata, as well as their compliance with data norms and standards.

Master data management (MDM) brings together all the techniques and processes that enable reference data to be managed in a centralized, consistent, and reliable manner. This reference data, also known as “master data”, is critical information, absolutely essential to the company’s activity. It can be information about customers, suppliers, products, operating and production sites, or data about employees. The purpose of master data management is to build a single repository for this reference data. This repository is then used by the different applications and systems of the company and guarantees access to reliable and consistent data.

What are the differences between metadata management and master data management?

Although both concepts are related to data management, metadata management and master data management (MDM) serve different company objectives and take distinct approaches.

While metadata management is primarily concerned with the management of information that describes data, its context, and its use, MDM focuses on the management of business-critical master data. These two different scopes make metadata management and master data management two complementary disciplines for your data strategy. Where metadata management focuses on the description and use of data, MDM focuses on the management and harmonization of business-critical master data.

What do master data management and Metadata management have in common?

The first thing master data management and metadata management have in common is that they both contribute to the efficiency and success of your data-driven projects. Both aim to guarantee the quality, relevance, and consistency of data. MDM and metadata management also both require dedicated processes and tools. Finally, both disciplines integrate and contribute to a broader data governance approach.

Combined, they allow you to be more agile, more efficient, and more responsible at the same time!

5 essential Zeenea features for a five-star Data Stewardship Program

by Zeenea Software | Jan 27, 2023 | Data Catalog, Data Compliance, Data governance, Metadata Management

You have data – and lots of it. However, it is messy, incomplete, and scattered into several different platforms, databases, and even spreadsheets. On top of this, some of your information is inaccessible, or worse – accessible to the wrong people. And as the go-to data experts of the company, Data Stewards must be able to identify the who, what, when, where, and why of their data to build a reliable stewardship program.

Unfortunately, Data Stewards face a major roadblock to success – the lack of tools to support their role. When dealing with large volumes of data, maintaining data documentation, managing enterprise metadata, and tackling quality & governance issues can be quite challenging.

This is where Zeenea steps in. Our data discovery platform – and its smart and automated metadata management features – facilitates the lives of Data Stewards. Discover 5 of them in this article.

Feature 1: Universal connectivity

Automatically extract and inventory metadata from your data sources

As mentioned above, a lot of enterprise data is spread across many different information sources, making it difficult, even impossible, for Data Stewards to manage and control their data landscape. Zeenea provides a next-generation data cataloging solution that centralizes and unifies all enterprise metadata into a single source of truth. Our platform’s wide-range of native connectors automatically retrieves and collects metadata through our APIs and scanners.

Feature 2: A Flexible & Adaptable Metamodel

Automate data documentation

Documenting information can be extremely time-consuming, with sometimes thousands of properties, fields, and other important metadata that need to be implemented for business teams to fully understand and have the necessary context on the data they are consulting.
Zeenea provides a flexible and adaptable way to build metamodel templates for pre-configured (datasets, fields, data processes, etc) and an unlimited amount of custom objects (procedures, rules, KPIs, regulations, etc).

Import or create your documentation templates by simply dragging & dropping your existing properties along with your tags, and other custom metadata into your templates. Made a mistake in your template? No problem! Add, remove, or modify your properties and sections as you please – your items are automatically updated after you’ve finished editing them.

After you’ve defined your templates, easily visualize all the assets that make up your metamodel, as well as their relationships with our dynamic diagram. Our user-friendly design shows the details of each type of object – their sections and their properties – and updates automatically after each template change. You can also zoom in or out on the object of your choice and export an image of your metamodel.

Do the same for your Glossary information! We separated the Physical & Logical metamodel from the Glossary metamodel so Data Stewards and other contributors can easily define and find their Business Glossary assets. Using the same process as the Physical & Logical metamodel, create or import semantic objects, organize them in hierarchies, and configure the way your glossary items are mapped with technical assets with our flexible templates.

Feature 3: Automatic Data Lineage

Trace your data transformations

In order for Data Stewards to build accurate and trustworthy compliance reports, data lineage capabilities are essential. Many software developers offer lineage capabilities, but rare are those who understand it. Via a visual and easy-to-interpret lineage graph, Zeenea offers your users the possibility to navigate through the lifecycle of their data. Click on any item to get an overview of its documentation, relations to other assets, as well as its metadata to obtain a 360° view of your catalog items.

Feature 4: Smart suggestions

Quickly identify personal data

With the GDPR, California Consumer Privacy Act, and other regulations regarding the security and privacy of the information of individuals, it can be a hassle to go through each existing set of information to ensure you’ve correctly indicated the data as personal. To always ensure your information is correctly labeled, Zeenea analyzes similarities between existing personal data by identifying and giving suggestions on which fields to tag as “personal data”. Data Stewards can accept, ignore, or delete suggestions directly from their dashboard.

Feature 5: An effective permission sets model

Ensure the right people are accessing the right data

For organizations with various types of users accessing their data landscape, it doesn’t make sense to give everyone full access to modify anything and everything. Especially when dealing with sensitive or personal information. For this reason, Zeenea designed an effective permission sets model to allow Data Stewards to increase efficiency for your organization and reduce the risk of errors. Assign read-only, edition, and admin rights in all or different parts of the catalog to not only ensure a secure catalog but also save time when data consumers need to find an asset’s referent.

Ready to start your data stewardship program with Zeenea?

If you’re interested in Zeenea’s features for your data documentation & stewardship needs, contact us for a 30-minute personalized demo with one of our data experts.

GET A DEMO

What is the difference between Data Fabric and Data Mesh?

by Zeenea Software | Nov 3, 2022 | Data Inspiration, Data Mesh

At first, organizations were focused on collecting their enterprise data. Now, the challenge is to leverage knowledge out of the data to bring intelligent insights for better decision-making. Numerous technologies and solutions promise to make the most of your data. Among them, we find Data Fabric and Data Mesh. While these concepts may seem similar, there are fundamental differences between these two approaches. Here are some explanations.

It is no secret that the immense volumes of data collected each day have many benefits for organizations. It can bring valuable customer insights so companies can personalize their offers and differentiate themselves from their competitors, for example. However, the growing number of digital uses creates an abundance of information that can be hard to exploit without a solid data structure.

According to Gartner’s forecasts, by 2024, more than 25% of data management solution vendors will provide full data structure support through a combination of their own and partner products, compared to less than 5% today.

In this context, there are several avenues that can be explored, but two stand out the most: Data Fabric and Data Mesh.

What is a Data Fabric?

The concept of a Data Fabric was introduced by Gartner back in 2019. The renowned institute describes a Data Fabric as the combined use of multiple existing technologies to enable metadata-driven implementation and augmented design.

In other words, a Data Fabric is an environment in which data and metadata are continuously analyzed for continuous enrichment and optimal value. But beware! A Data Fabric is not a finished product or solution – It is a scalable environment that relies on the combination of different solutions or applications that interact with each other to refine the data.

A Data Fabric relies on APIs and “No Code” technologies that allow synergies to be created between various applications and services. These solutions thus enable the data to be transformed to extract the quintessence of knowledge throughout its life cycle.

What is Data Mesh

The concept of Data Mesh was introduced by Zhamak Dehghani of Thoughtworks in 2018. It is a new approach to data architecture, a new mode of organization, based on meshing data. Data Mesh is based on the creation of a multi-domain data structure. Data is mapped, identified, and reorganized according to its use, its target, or its potential exploitation. Data Mesh is based on these fundamental principles: the data owner, self-service, and interoperability. These three principles enable the creation of decentralized data management. The advantage? Bringing about interactions between different disparate data domains to generate ever more intelligence.

The key differences between Data Fabric and Data Mesh

To fully understand the differences between Data Fabric and Data Mesh, let’s start by discussing what brings them together. In both cases, there is no such thing as a “ready-to-use” solution.

Where a Data Fabric is based on an ecosystem of various data software solutions, Data Mesh is a way of organizing and governing data. In the case of Data Mesh, data is stored in a decentralized manner in their respective domains. Each node has local storage and computing power, and no single point of control is required for operation.

With a Data Fabric, on the other hand, data access is centralized with clusters of high-speed servers for networking and high-performance resource sharing. There are also differences in terms of data architecture. For example, Data Mesh introduces an organizational perspective, independent of specific technologies. Its architecture follows a domain-centric design and product-centric thinking.

Although they have different rationales, Data Mesh and Data Fabric serve the same company objectives of making the most of your data assets. In this sense, despite their differences, they should not be considered opposites but rather complementary.

The traps to avoid for a successful data catalog project – Project Leadership

by Zeenea Software | Sep 29, 2022 | Data Catalog, Metadata Management

Metadata management is an important component in a data management project and it requires more than just the data catalog solution, however connected it may be.

A data catalog tool will of course reduce the workload but won’t in and of itself guarantee the success of the project.

In this series of articles, discover the pitfalls and preconceived ideas that should be avoided when rolling out an enterprise-wide data catalog project. The traps described in this are articulated around 4 central themes that are crucial to the success of the initiative:

Data culture within the organization
Internal project sponsorship
Project leadership
Technical integration of the Data Catalog

—

As with all projects, a metadata management initiative has to be properly steered to meet the objectives within the best time frame and costs. It is important however that the steering of the project doesn’t fall into some of the ruts we describe below.

The quantity of the metadata should never become more important than metadata quality

The purpose of the data catalog is to document assets from company data. When the project starts, the absence of information often leads to the same tendency: adding lots of information.

A good data catalog however isn’t characterized by the quantity of objects, but rather by the quality and coherence of the information. These characteristics will require close supervision in identifying priorities both in terms of the perimeters covered and in terms of the information selected for inclusion.

If this may cause frustration, it will very quickly prove its effectiveness and crucial importance for the project to succeed. Indeed, users will rightly consider the data catalog as a source of truth in the same way a dictionary is for the spoken language. It’s always better to offer, starting with a targeted audience, selected and quality content, thus offering an experience that will induce people to come back to the tool for future searches. It will ultimately be difficult to keep users interested if the first exposure is a failure.

A data catalog won’t be filled spontaneously, even when it is open to users

The data catalog is open to many users, of which some (sometimes many) have knowledge of the assets in question. That said, spontaneous and regular updating of the data from the start is exceedingly rare.

The reality is quite different: It is crucial to be accompanied at the beginning of the project but also throughout.

Both the quality and quantity of the information have to be supervised just as it is important to make aware, present, and educate the contributors. Managing the contributions can also be achieved through the creation of virtuous processes that enable control and an invitation for correction/enrichment of the catalog.

It’s impossible to set all the objectives of the data catalog project from the start without making them evolve

The data catalog has to meet the expectations of many users, all with many requirements.

It is therefore unreasonable to assume that you already have the complete list of expectations at the start of the project. Just as it would be naive to believe that this list will be fixed and immutable from the beginning of the project. It is, therefore, the role of the Data Office to continuously collect and analyze the requirements, interpret them accurately, prioritize, and transform them into appropriate content.

Generally, the requirements evolve according to different parameters that are not established at the onset. For instance, the level of enterprise and staff maturity with regard to data management will mature over time, as will the development of use cases around data, not to mention changes in data-related regulations.

All these parameters will potentially have a strong impact on the content that the data catalog will have to cover, both in terms of the scope of the parameters as well as the nature of the information provided for the assets present.

The 10 Traps to Avoid for a Successful Data Catalog Project

To learn more about the traps to avoid when starting a data cataloging initiative, download our free eBook!

READ THE EBOOK

10 Traps To Avoid For A Successful Data Catalog Project Mockup

The traps to avoid for a successful data catalog project – Internal sponsorship

by Zeenea Software | Sep 29, 2022 | Data Catalog, Metadata Management

Metadata management is an important component in a data management project and it requires more than just the data catalog solution, however connected it may be.

A data catalog tool will of course reduce the workload but won’t in and of itself guarantee the success of the project.

Data culture within the organization
Internal project sponsorship
Project leadership
Technical integration of the Data Catalog

—

Metadata management projects will inevitably lead to multiple changes, which will impact the organization and/or the responsibilities of the collaborators. Managerial changes will then be necessary and cannot be made without the initiative being carried to upper echelons.

A data catalog project cannot succeed without the internal support of management

In a metadata management initiative, some collaborators will inherit new responsibilities and new directives on top of their current responsibilities. The initiative is often steered by a dedicated, transverse team that orchestrates the project and facilitates its execution. That said, the collaborators being asked to contribute to the project are often not actually managed by that team and work in another service.

Without a managerial go-between within the different teams and a common discourse, the initiative remains somewhat fragile. At the very first obstacle, the initiative can even be scuppered because the necessary steps were never officially made.

The best approach will depend mostly on the internal organization of your company. It is strongly advisable however to write the objectives down in order to make them official, nudge the work of the contributors of the initiative in the right direction, and steer the results.

A data catalog project requires an initial investment above all

It is common to carry out a census of all the information at the start of a metadata management project before feeding the data catalog. This information usually comes from existing documentation, but can also come from colleagues who have their own insights into a specific element.

The first step is to centralize and secure this metadata by inputting it into the data catalog.

The catalog has to provide a simple way to centralize all this information and share it with everybody. As a connected data catalog, Zeenea provides various mechanisms to do this. It can automatically bring up metadata from master systems and liberate the contributors from having to work on the information again.

Moreover, the connectivity serves another purpose: making sure the catalog is kept up to date and aligned with the master systems. This applies to the metadata that’s automatically synchronized but also to the metadata coming from contributions from collaborators: by its nature, an information system is alive. Data evolves, as does the associated documentation. Upkeep of the documentation is therefore critical in order to ensure its freshness.

>> Discover Zeenea <<

The 10 Traps to Avoid for a Successful Data Catalog Project

To learn more about the traps to avoid when starting a data cataloging initiative, download our free eBook!

READ THE EBOOK

How does a data catalog help companies implement successful Data Stewardship programs?

by Zeenea Software | Jul 6, 2022 | Data Catalog, Metadata Management

By implementing a data stewardship program in your organization, you ensure not only the quality of your data but also that it can be used easily and effectively by all your employees. As a key player in data governance and management, the Data Steward needs specific tools, the first of which is the data catalog.

The role of data in companies is becoming increasingly strategic, and not just for large organizations! Indeed, to define business strategies, manage distribution, or organize production, the exploitation of data constitutes a major competitive advantage. To deliver its full potential, data must be reliable, of high quality, and perfectly organized. These characteristics are linked to a discipline: Data Stewardship.

The Data Steward, also known as the Master of Data, acts as the guarantor of optimal data exploitation. How? By centralizing all data, regardless of its source, in an environment that is accessible to all business lines in a simple, intuitive, and operational manner. A Data Stewardship program is based on a rigorous methodology, a global vision of available data, and an ambition to rationalize data in order to develop a strong data culture. However, vision, understanding, and methodology do not exempt the Data Steward from relying on the right tools to accomplish their missions: a data catalog is one of the essential tools for a successful Data Stewardship project.

A data catalog’s objectives

A data catalog exploits metadata – data on data – to create a searchable repository of all enterprise information assets. This metadata collected by various data sources (Big Data, Cloud services, Excel sheets, etc.) is automatically scanned to enable users of the catalog to search for their data and get information such as the availability, freshness, and quality of a data asset. A data catalog centralizes and unifies the metadata collected so that it can be shared with IT teams and business functions. This unified view of data allows organizations to:

Sustain a data culture,
Accelerate data discovery,
Build agile data governance,
Maximize the value of data,
Produce better and faster
Ensure good control over data.

The benefits of a data catalog for Data Stewards

From importing new data sources to tracking information updates, the ability of a data catalog to track and monitor metadata in real-time automatically allows data stewards to gain efficiency. A data catalog provides 360° visibility into your data from its origin to all of its transformations over time. There are four key benefits to using a data catalog as part of a Data Stewardship program:

Benefit #1: Maintain up-to-date documentation

Your data is constantly active. It is collected, valued, exploited, enriched… To have a perfect understanding of your data assets, you need up-to-date documentation regarding its data sources and how they are used. A data catalog is designed to do just that.

Zeenea’s advantage: Our catalog automatically retrieves and collects metadata through our APIs and scanners to always ensure that your data is up-to-date. View your data’s origins and transformations over time with our smart lineage capabilities.

Benefit #2: Ensure data quality

The first vocation of a data catalog is to keep a clear view of your data via metadata. The definitions, structures, sources, uses, procedures to follow… by nature, metadata management by a data catalog contributes to guarantee the quality of your data.

Zeenea’s advantage: Our data catalog enables your Data Stewards to build flexible metamodel templates for predefined and custom item types. Simply drag & drop your properties, tags, and other fields into your documentation templates for all your catalog items.

Benefit #3: Comply with data regulations

Compliance with data regulations is a crucial issue in a Data Stewardship program. A data catalog, through its ability to organize data and centralize it in a clear, healthy, and readable environment, helps to comply with these regulatory requirements.

Zeenea’s advantage: Through machine learning capabilities, our Data Catalog speeds up time-consuming tasks by analyzing similarities between existing personal data. It provides smart recommendations by identifying and giving suggestions to tag personal data.

Benefit #4: Monitor data lifecycle

Between governance, quality, and security, your Data Stewardship project implies monitoring the lifecycle of your data in real-time. The data catalog responds to this challenge by offering you the possibility to monitor all activities affecting your data.

Zeenea’s advantage: our data catalog provides Data Stewards with a dashboard that tracks and monitors metadata activity. Check the completion levels of your documentation, the most frequently accessed and searched for catalog items, the connectivity status of your catalog, and get smart recommendations on the sensitivity level and additional properties to add to your fields.

Organization, knowledge, transparency, scalability… a data catalog is tailored to accompany your Data Stewardship project!

Start a Data Stewardship program with Zeenea

Zeenea Data Catalog provides a metadata management solution that enables Data Stewards to overcome the challenges associated with handling increasingly large volumes of data. Our solution helps organizations maximize the value of their data by reducing the time spent on complex and time-consuming documentation tasks, and by breaking data silos to increase enterprise data knowledge.

Contact us now for a free and personalized demonstration with one of our experts:

What makes a data catalog “smart”? #3 – Metadata Management

by Zeenea Software | Feb 16, 2022 | Data Catalog, Metadata Management

A data catalog harnesses enormous amounts of very diverse information – and its volume will grow exponentially. This will raise 2 major challenges:

How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
How to find the most relevant datasets for any specific use case?

At Zeenea, we think that a data catalog should be Smart in order to answer these 2 questions, with smart technological and conceptual features that go wider than the sole integration of AI algorithms.

In this respect we have identified 5 areas in which a data catalog can be “Smart” – most of which do not involve machine learning:

—

It is in the field of metadata management that the notion of the Smart Data Catalog is most commonly associated with algorithms, machine learning, and AI.

How is metadata management automated?

Metadata management is the discipline that consists in valuing the metamodel attributes for the inventoried assets. The workload required is usually proportional to the number of attributes in the metamodel and the number of assets in the catalog.

The role of the Smart Data Catalog is to automate this activity as much as possible, or at the very least to help the human operators (Data Stewards) do so in order to ensure greater productivity and reliability.

As seen in our last article, a smart connectivity layer enables the automation of part of the metadata but this automation is very much restricted to a limited subset of the metamodel – mostly technical metadata. A complete metamodel, even a modest one, also has dozens of metadata that cannot be extracted from the source systems registries (because they are not there, to begin with).

To solve this equation, several approaches are possible:

Pattern recognition

The most direct approach consists in looking to identify patterns in the catalog in order to suggest metadata values for new assets.

Put simply, a pattern will include all the metadata of an asset and the metadata of its relations with other assets or other catalog entities. Pattern recognition is typically done with the help of machine learning algorithms.

The difficulty with the implementation of this approach is precisely qualifying the information assets in a numerical form in order to feed the algorithms and select the relevant patterns. A simple structural analysis is not enough: two datasets can contain identical data but in different structures. Relying on the identity of the data isn’t efficient either: two datasets can contain identical information but with different values. For example, 2020 client invoicing in one dataset, 2021 client invoicing in the other.

In order to solve this problem, Zeenea relies on a technology called fingerprinting. In order to build the fingerprint, we pull up 2 types of features from our clients’ data:

A group of features adapted to the numerical data (mostly statistical indicators)
Data emanating from word embedding models (word vectorization) for the textual data.

Fingerprinting is at the heart of our intelligent algorithms.

The other embedded approaches in a suggestion engine

While pattern recognition is indeed an efficient approach for suggesting the metadata of a new asset in a catalog, it rests on an important prerequisite: in order to recognize a pattern, there has to be one to recognize. In other words, this only works if there are a number of assets in the catalog (which is obviously not the case at the start of a project).

And it’s precisely in these initial phases of a catalog project that the metadata management load is the highest. It is, therefore, crucial to include other approaches likely to help the Data Stewards in these initial phases, when a catalog is more or less empty…

The Zeenea suggestion engine, which provides intelligent algorithms to assist the management of the metadata, also provides other approaches (which we enrich regularly).

Here are some of these approaches:

Structural similarity detection
Fingerprint similarity detection
Name approximation

This suggestion engine, which analyzes the catalog content in order to determine the probable values of the metadata from the assets that have been integrated, is an everlasting subject of experimentation. We regularly add new approaches, sometimes very simple and sometimes much more sophisticated. In our architecture, it is a dedicated service whose performances improve as the catalog grows and as we enrich our algorithms.

Zeenea has chosen to use the lead time as our main measuring metric for the productivity of the Data Stewards (which is the ultimate objective of smart metadata management). Lead time is a notion that stems from lean management and which measures, in a data catalog context, the time elapsed between the moment an asset is inventoried and the moment all its metadata has been valued.

For more information on how Smart metadata management enhances a Data Catalog, download our eBook:

“What is a Smart Data Catalog?”!

Download the ebook

Exploiting the value of Data Lineage in the organization: A user-centric approach

by Zeenea Software | Nov 14, 2021 | Data Catalog

In our previous article, we broke down Data Lineage by presenting the different lineage typologies (physical layer, business layer and semantic layer) and the different levels of granularity (values, fields, datasets, application).

In this article, we will present our matrix to help you concentrate your efforts and resources where the value of Data Lineage is strongest for your different teams.

Our business centered matrix

To fully understand Zeenea’s approach to data lineage, which is centered on the business teams within the company, please read our article on the breakdown of data lineage.

The different business profiles in the organization

We have categorized the populations who wish to leverage the value of Data Lineage in an organization into 4 broad categories:

IT: The engineers and architects responsible for developing and maintaining the infrastructure, flows and data applications.

Analytics: The teams in charge of analyzing data, building indicators, dashboards, reports, etc.

Business: All the people in charge of conceiving and working on the uses and functional applications around the data – project managers, product managers, business analysts, etc.

Compliance: The teams responsible for regulatory compliance, security, internal control, etc..

Added value of Data Lineage according to the business profile

The following matrix summarizes the value added of Data Lineage for the different combinations of typology, granularity and business profile.

This matrix, bearing in mind that the lineage from the upper level can be deduced from the lineage of the lower level, could tempt one to have lineage management at the field level as the objective: it is on this level that the benefits are the most obvious, and from here that lineage can be produced automatically at the levels above.

Of course, things are not that simple!

While there are many benefits to field-to-field line age, it has one major drawback: Its cost. Whatever the lineage layer being looked into, the production and maintenance cost will depend mainly on two variables: the volume (number of objects taken into account and number of links between them) and the ability to automate the retrieval and updating of this information.

On both these aspects, field-to-field lineage clearly presents the most unfavorable profile…

The limits of field-to-field lineage: huge volumes of information

Concerning the volume, it is easy to understand that the number of materialized fields in an information system, even a modest-sized one, easily reaches tens of thousands, if not hundreds of thousands or even millions. Maintaining the lineage information manually on such a volume of objects is not feasible. The only feasible solution is therefore automation on a large scale.

Limited automation capabilities

In theory, field-to-field technical lineage can be automated by inspecting the different processing stages, from the initial capture of the data to its final uses. In practice, this automation comes up against the very great heterogeneity of data integration and processing solutions. Some vendors offer solutions to perform these operations.

We confess: we don’t believe in those solutions, and for two reasons. First, reverse engineering is a delicate operation and its reliability cannot be 100% guaranteed. And secondly, the range of solutions and languages used in data pipelines is too vast, and the constant innovation in this field makes it difficult for a commercial solution to guarantee full coverage of all the technologies implemented in a given environment.

Field-by-field granularity is attractive, but out of reach in practice.

Our approach for optimized Data Lineage

The pivot: the physical layer at the dataset level

If we go back to the matrix presented above, it appears that the value of the lineage at the dataset level is very close to that of the lineage field-to-field.

For IT, business and analytical profiles, the value is in most cases very similar. The main difference arises with compliance. For most standards, the lineage documentation requirement relates to fields. But compliance does not apply to all data in the organization, only those that are considered critical data elements (CDE).

There are different types of CDEs – personal data, sensitive data, risk data, etc. But they have the advantage of constituting only a minute percentage of all the data, often a few dozen or a few hundred fields whose downstream or upstream lineage must be provided.

Going forward, here is the general approach we favor for the physical layer:

Focus the effort on the lineage at the dataset level and strive for the most advanced automation possible.

Associate datasets (and other physical objects on the same level) with the applications to which they are attached. This operation is generally easy to automate, globally stable over time, and can, at worst, be managed manually in the catalog.

Fill in locally with field-to-field lineage focusing on the CDEs – this can be automated (if possible), but can also rely on periodic review processes which are commonplace in regulatory frameworks.

Business and semantic layers of lineage

As for the other layers (business and semantic), the approach is significantly different. Indeed, in this area, automation is hardly possible. Therefore: business lineage and semantic lineage will probably have to be managed manually.

For business lineage, I propose a top-down approach. This means that the first task should be devoted to defining the business lineage at the application level. The datasets and fields contained in the applications will inherit this business lineage. We should also be able to define the business lineage at a finer level, but only when a use case justifies it.

For the semantic layer, things are a little different. Indeed, a specific effort is necessary to build the glossary. This (modeling) effort will be more or less important depending on the size of your data landscape, and the prior existence of models that can be imported or integrated into the catalog.

The natural anchor point of the semantic model on the physical layer of the lineage is at the field level. But again, automation is impractical – you probably don’t have a system that systematically references the meaning of each field in all your systems.

The association between the fields of the physical layer and the definitions of the semantic layer will therefore have to be done manually, which again represents a time consuming task if you want to do it thoroughly.

Conclusion

Data Lineage is a complex concept, which can be broken down in several layers (physical, business and semantic) and several levels of granularity (value, field, dataset, application).

The value of the lineage can also be represented in the form of a matrix that is very dependent on use cases, and the populations that exploit it. The cost of production and maintenance of lineage information is a function of the automation capacity and the volume of objects at the level considered.

To learn more about Data Lineage best practices, download our eBook: All you’ve ever wanted to know about Data Lineage!

Download

The Data catalog: an essential solution for metadata management

by Zeenea Software | Sep 6, 2021 | Data Catalog, Metadata Management

Your company produces or uses more and more data? To better classify, manage, and give meaning to your data, there must be order. By putting in place rigorous metadata management, with the help of a data catalog, you can gain relevance and efficiency.

Companies are producing more and more data. To the point where processing and exploitation capacities can be undermined, not because of a lack of knowledge, but rather because of a lack of organization. When data volumes explode, data management becomes more complex.

To put it all in order, metadata management becomes a central issue.

What is metadata and how to manage it?

Metadata is used to describe the information contained in data: source, type, time, date, size, … The range of metadata that can be attached to data is vast.

Without metadata, your data is decontextualized, it loses its knowledge and becomes difficult to classify, order and value. But because they are so numerous and disparate, you must be able to master this mountain of information.

Metadata management is becoming an essential practice to ensure that it is up-to-date, accurate and accessible. To meet the challenge of optimal metadata management, it is essential to rely on a Data Catalog.

Data Catalog: What is it for?

A data catalog is a bit like the index of a gigantic encyclopedia. Because the data you collect and manage on a daily basis is diverse by nature, it must be classified and clearly identified. Otherwise, your data portfolio would become an unfathomable mess from which you would not derive any added value.

At Zeenea, we define a data catalog as:

A detailed inventory of all of an organization’s data assets and their metadata, designed to help data professionals quickly find the most appropriate information for any business and analytical purpose.

A Data Catalog is a pillar of metadata management through the following features

Data Dictionary

Each piece of data collected or used is described in such a way that it can be put into perspective with others. This metadata thesaurus is a pillar of efficient and pragmatic exploitation of your data catalog. By referencing all of your company’s data in a Data Dictionary, the Data Catalog helps optimize accessibility to information even if the user does not have access to the software concerned.

Metadata Registry

A dynamic metadata repository intervenes at all levels: from the dataset to the data itself. For each element, this metadata registry can include a business and technical description, give you information on its owners, have quality indicators or even help create a taxonomy (properties, tags, etc.) for your items.

The Data Search Engine

Your data catalog will allow you to access your data through its integrated search features. All the metadata entered in the registry can be searched from the data catalog search engine. Searches can be sorted and filtered at all levels.

Data Catalog and Metadata: the two pillars of data excellence!

There’s no need to try to oppose having a data catalog and the concept of metadata management because they simply go hand in hand.

A Data Catalog is a kind of repository that cannot be ignored to standardize all the metadata that are likely to be shared in your company. This repository contributes to a detailed understanding and documentation of all your data assets.

But beware! The integration of a data catalog is a project that requires rigor and method. To begin this project and unleash your data potential, start by conducting a complete audit of your data and proceed in an iterative manner.

Download your free metamodel template!

download

The Data Catalog is a major lever to reinforce the management of your company’s metadata.

It will absolutely guarantee the proper use of your data!

Marquez: the metadata discovery solution at WeWork

by Zeenea Software | Dec 10, 2020 | Data Inspiration, Metadata Management

Created in 2010, WeWork is a global office & workspace leasing company. Their objective is to provide space for teams of any sizes including startups, SMEs, and major corporations, to collaborate. In order to achieve this, what WeWork provides can be broken down into three different categories:

Space: To ensure companies with optimal space, WeWork must provide the appropriate infrastructure, which consists of booking rooms for interviews / one on ones or even entire buildings for huge corporations. They also must make sure they are equipped with the appropriate facilities such as kitchens for lunch and coffee breaks, bathrooms, etc.

Community: Via WeWork’s internal application, the firm enables WeWork members to connect with one another, whether it’s local within their own WeWork space, or globally. For example, if a company is in need of feedback for a project from specific job titles (such as a developer or UX designer), they can directly ask for feedback and suggestions via the application to any member, regardless of their location.

Services: WeWork also provides their members with full IT services if there are any problems as well as other services such as payroll services, utility services, etc.

In 2020, WeWork represents:

More than 600,000 memberships
Locations in 127 cities in 33 different countries,
850 offices worldwide,
Generated $1.82 billion in revenue.

It is clear that WeWork works with all sorts of data from their staff and customers, whether that be individuals or companies. The huge firm was therefore in need of a platform where their data experts could view, collect, aggregate, and visualize their data ecosystem’s metadata. This was resolved by the creation of Marquez.

This article will focus on WeWork’s implementation of Marquez mainly through free & accessible documentation provided on various websites, to illustrate the importance of having an enterprise-wide metadata platform in order to truly become data-driven.

Why manage & utilize metadata?

In his talk “A Metadata Service for Data Abstraction, Data Lineage & Event-based Triggers” at the Data Council back in 2018, Willy Lulciuc, Software Engineer for the Marquez project at WeWork explained that metadata is crucial for three reasons:

Ensuring data quality: when data has no context, it is hard for data citizens to trust their data assets: are there fields missing? Is the documentation up to date? Who is the data owner and are they still the owner? These questions are answered through the use of metadata.

Understanding Data lineage: knowing your data’s origins and transformations are key to being able to truly know what stages your data went through over time.

Democratization of datasets: According to Willy Lulciuc, democratizing data in the enterprise is critical! Having a central portal or UI available for users to be able to search for and explore their datasets is one of the most important ways companies can truly create a self-service data culture.

To sum up: creating a healthy data ecosystem! Willy explains that being able to manage and utilize metadata creates a sustainable data culture where individuals no longer need to ask for help to find and work with the data they need. In his slide, he goes through three different categories that make up a healthy data ecosystem:

Being a self service ecosystem, where data and business users have the possibility to discover the data and metadata they need, and explore the enterprise’s data assets when they don’t know exactly what they are searching for. Providing data with context, gives the ability to all users and data citizens to effectively work on their data use cases.

Being self-sufficient by enabling data users the freedom to experiment with their datasets as well as having the flexibility to work on every aspect of their datasets whether they input or output datasets for example.
And finally, instead of relying on certain individuals or groups, a healthy data ecosystem allows for all employees to be accountable for their own data. Each user has the responsibility to know their data, their costs (is this data producing enough value?) as well as keeping track of their data’s documentation in order to build trust around their datasets.

Room booking pipeline before

As mentioned above, utilizing metadata is crucial for data users to be able to find the data they need. In his presentation, Willy shared a real situation to prove metadata is essential: WeWork’s data pipeline for booking a room.

For a “WeWorker”, the steps are as follows:

Find a location (the example was a building complex in San Francisco)
Choose the appropriate room size (usually split into the number of attendees – in this case they chose a room that could greet 1 – 4 people)
Choose the date for when the booking will take place
Decide on the time slot the room is booked for as well as the duration of the meeting
Confirm the booking

Now that we have an example of how their booking pipeline works, Willy proceeds to demonstrate how a typical data team would operate when wanting to pull out data on WeWork’s bookings. In this case, the example exercise was to find the building that held the most room bookings, and extract that data to send over to management. The steps he stated were the following:

Read the room bookings from a data source (usually unknown),
Sum up all of the room bookings and return the top locations,
Once the top location is calculated, the next step is to write it into some output data source,
Run the job once a hour,
Process the data through .csv files and store it somewhere.

However, Willy stated that even though these steps seem like it’s going to be good enough, usually, there are problems that occur. He goes over three types of issues during the job process:

Where can I find the job input’s dataset?
Does the dataset have an owner? Who is it?
How often is the dataset updated?

Most of these questions are difficult to answer and jobs end up failing… Without being sure and trusting this information, it can be hard to present numbers to management ! These sorts of problems and issues are what made WeWork develop Marquez!

What is Marquez?

Willy defines the platform as an “open-sourced solution for the aggregation, collection, and visualization of metadata of [WeWork’s] data ecosystem”. Indeed, Marquez is a modular system and was designed as a highly scalable, highly extensible platform-agnostic solution for metadata management. It consists of the following components:

Metadata Repository: Stores all job and dataset metadata, including a complete history of job runs and job-level statistics (i.e. total runs, average runtimes, success/failures, etc).

Metadata API: RESTful API enabling a diverse set of clients to begin collecting metadata around dataset production and consumption.

Metadata UI: Used for dataset discovery, connecting multiple datasets and exploring their dependency graph.

Marquez’s design

Marquez provides language-specific clients that implement the Metadata API. This enables a diverse set of data processing applications to build a metadata collection. In their initial release, they provided support for both Java and Python.

The Metadata API extracts information around the production and consumption of datasets. It’s a stateless layer responsible for specifying both metadata persistence and aggregation. The API allows clients to collect and/or obtain dataset information to/from the Metadata Repository.

Metadata needs to be collected, organized, and stored in a way to allow for rich exploratory queries via the Metadata UI. The Metadata Repository serves as a catalog of dataset information encapsulated and cleanly abstracted away by the Metadata API.

According to Willy, what makes a very strong data ecosystem is the ability to search for information and datasets. Datasets in Marquez are indexed and ranked through the use of a search engine based keyword or phrase as well as the documentation of a dataset: the more a dataset has context, the more it is likely to appear first in the search results. Examples of a dataset’s documentation is its description, owner, schema, tag, etc.

You can see more detail of Marquez’s data model in the presentation itself here → https://www.youtube.com/watch?v=dRaRKob-lRQ&ab_channel=DataCouncil

The future of data management at WeWork

Two years after the project, Marquez has proven to be a big help for the giant leasing firm. They’re long term roadmap is to solely focus on their solution’s UI, by including more visualizations and graphical representations in order to provide simpler and more fun ways for users to interact with their data.

They also provide various online communities via their Github page, as well as groups on LinkedIn for those who are interested in Marquez to ask questions, get advice or even report issues on the current Marquez version.

Sources

A Metadata Service for Data Abstraction, Data Lineage & Event-based Triggers, WeWork. Youtube: https://www.youtube.com/watch?v=dRaRKob-lRQ&ab_channel=DataCouncil

29 Stunning WeWork Statistics – The New Era Of Coworking, TechJury.com:https://techjury.net/blog/wework-statistics/

Marquez: Collect, aggregate, and visualize a data ecosystem’s metadata, https://marquezproject.github.io/marquez/

Marquez: An Open Source Metadata Service for ML Platforms Willy Lulciuc

What is Data Literacy? Tips on becoming data literate.

by Zeenea Software | Oct 28, 2020 | Metadata Management

Data literacy has been a trending topic for a few years, and it is known that it is a vital skill for enterprises seeking to fully transform their organizations and become data-driven.

As technology can be a point of failure if not handled properly, it is often not the most important roadblock to progress. In fact, in Gartner’s annual Chief Data Officer survey, the top roadblocks for success were cultural factors – human, skills, and data literacy.

However many of these enterprises still struggle to understand what data literacy truly is, or know how to reshape their cultural organization into a data literate one.

In its 2020 survey, New Vantage Partners observed that:

“Companies continue to focus on the supply side for data and technology, instead of increasing demand for them by business executives and employees. It’s a technology push rather than a pull from humans who want to make more data-based decisions, develop more intelligent business processes, or embed data and analytics into more products and services.”

In this article, we’d like to shed light on what data literacy is, why it is important for your enterprise, and tips on how to become a data literate organization.

The definition of data literacy

Just as literacy means to have “the ability to read for knowledge, write coherently and think critically about printed material” data literacy is the ability to consume for knowledge, produce coherently and think critically about data.

In 2019, Gartner defined data literacy as: “the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied — and the ability to describe the use case, application and resulting value.”

So, based on these definitions, we can conclude that data literate people can, among other things:

make analyses using data,
use data to communicate ideas for new services, products, workflows or even strategies,
understand dashboards (visualizations for example),
make data-based decisions rather than based on intuition

In summary, being data literate signifies having the set of skills to be able to effectively use data individually and collaboratively.

Why is data literacy important?

Gartner expects that, by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy to overcome extreme deficiencies. By 2020, 50% of organizations will lack sufficient AI and data literacy skills to achieve business value.

The increasing volume and variety of data that businesses are flooded with on a daily basis require employees to employ higher order skills such as critical thinking, problem-solving, computational, and analytical thinking using data. And as organizations become more data-driven, poor data literacy will become an inhibitor to growth. In fact, in their survey “The Human Impact of Data Literacy”, Accenture found that:

75% of employees are uncomfortable when working with data.
1/3 of employees have taken a sick day from work due to headaches working with data.
A lack of data literacy costs employers 5 days of productivity translating to billions of dollars in lost productivity per employee each year.

Furthermore, a Deloitte survey conducted in 2019 found that 67% of executives are not comfortable accessing or using data resources.

Data uplifts the success of organizations in creating both physical and digital business opportunities—improving accuracy, increasing efficiency and augmenting the ability of the workforce to deliver greater value. It is therefore important and essential to be able to interpret, analyze and communicate findings on data to be able to uncover the secrets to successful business and competitive advantage.

Tips on how to become data literate

In order to build a successful data literacy program, here are some tips to help your organization on your data fluency journey:

Tip #1 – Develop a data literacy vision and associated goals

Any organization investing in data and AI capabilities should have already undertaken the creation of a data vision and roadmap. In the process of doing so, data and IT leaders will have identified and prioritized the areas of business where data can produce value.

These steps are critical to creating a data-literate organization and reducing the friction around understanding and using data.

Management and HR need to communicate across the entire enterprise that data is a strategic asset that creates value. Using the data vision and roadmap as context, they should be able to explain to all employees why data matters, how it creates value, and how it impacts the business.

The absence of a clear vision for data and a plan to create value out of it, will create frustration and, as a consequence, employees will lack understanding of why they are being asked to make efforts and therefore, not have the motivation to do so.

In addition, a data literacy vision should detail desirable skills, abilities, and the level of literacy required for different business units and roles.

Business, IT, and HR leaders need to create a framework to achieve literacy goals, measure progress, and create a way to maintain data literacy. This includes deciding what skills are required, how to measure & track skills development, and to what degree different parts of the organization should use data in achieving their strategic objectives.

Tip #2 – Asses workforce skills

Data literacy skills should ideally be assessed during the recruitment process for new hires. In this way, HR will already know what kind of data literacy learning should be offered to the new hire over time.

However, for already existing employees, HR can map current employee data skills based on the roles and responsibilities provided in the above steps, and determine where there are gaps.

Tip #3 – Create data literacy modules

According to Qlik, only 34% of firms provide data literacy training.

In most cases, the HR department is responsible for helping business managers identify and track areas of improvement and development opportunities for employees. They are also in charge of organizing the procedures for learning specific organizational skills as well as the time it takes. It’s no different when it comes to becoming data literate.

Once HR and managers have a general idea of an employee’s or a business unit’s strengths and weaknesses in data skills, HR can begin to construct personalized and efficient learning programs that allow employees to upskill in data and analytics responsibilities.

Tips #4 – Track, measure, and repeat

A successful data literacy program takes time to put in place. Business leaders must allow their employees to invest the time required to become data literate and improve their skills. Over time, data thinking will become part of the corporate culture.

Finally, it’s important to communicate data literacy progress across the enterprise and on an individual basis. Tracking and communicating on the progress is key to continuing the evaluation of your organization’s data roadmap, vision and literacy.

This type of long-term planning and investment in educating the entire organization about how to access, understand and analyze data on the job will accelerate the efforts and investment that data science, machine learning and AI teams are making.

The results of data literacy efforts will allow organizations to finally be able to embrace and leverage data across the enterprise and for maximum value!

What is data preparation?

by Zeenea Software | Jul 20, 2020 | Metadata Management

When talking about data management, we often speak of the term “data preparation”.

According to SearchBusinessAnalytics, data preparation is the process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics and machine learning applications. In other words, it is the process of cleaning and transforming raw data prior to analysis.

Data preparation is often a lengthy process for data and business users, but nevertheless essential in order to give context to data and turn it into valuable business insights. In 2016, Forbes said that 76% of data scientists stated that data preparation is the worst part of their jobs! However, accurate business decisions can only be made through the analysis of clean data.

How data preparation works

Data preparation is an essential part of many enterprise applications maintained by IT, such as data warehousing or business intelligence. It is also a practice conducted by the business for ad hoc reporting and analytics, with IT and tech-savvy business users, such as data scientists, routinely burdened by requests for customized data preparation.

These days there’s growing interest in empowering business users with self-service tools for data preparation – so they can access and manipulate data sources on their own, without technical proficiency.

The steps for data preparation are the following:

Step 1: Access and gather data

The first step in data preparation is to be able to access data from any source, no matter the origin, narrative or format. The optimal solution for giving enterprise-wide access to data is by implementing a data catalog solution. This essential tool is the key to starting your data preparation journey.

>> For more information on Zeenea Data Catalog <<

Step 2: Discover data

After accessing and gathering data, the next step is to discovery data. Data discovery allows enterprises to adequately assess the full data picture. It helps all employees understand their data and their context through metadata. It is also very useful for enterprises seeking better compliance management. It allows organizations to know what data is personal/sensitive and where it can be found. In addition, data discovery can bolster innovation, as it unblocks essential information for satisfying customers and gaining competitive advantage.

Step 3: Cleanse data

Traditionally the most time-consuming part of data preparation, cleaning up data is nevertheless one of the most important tasks for removing bad data. Bad data can include outdated data, duplicate data, unreliable data, etc. Cleansing data therefore includes tedious tasks such as filling in missing information, making data private or sensitive, adding descriptions, and standardizing data patterns.

Step 4: Enrich data

After cleansing all the data, it is time to start transforming and enriching the data. This step includes connecting your data with other related data sources to provide deeper insights. A data catalog is also an important part of this step in data preparation.

>> More information on Zeenea’s connectors <<

Step 5: Store data

The last step in data preparation is to store data. By correctly storing your enterprise data, this enables data teams to be able to use fresh, clean data for their analysis.

The Future of Data Preparation

Initially focused on analytics, data preparation has evolved to address a much broader set of uses cases and can be used by a larger range of users.

Although it improves the personal productivity of whoever uses it, it has evolved into an enterprise tool that fosters collaboration between IT professionals, data experts, and business users.

Gartner’s top Data & Analytics trends in 2020

by Zeenea Software | Jul 16, 2020 | Data Inspiration, Metadata Management

The recent global pandemic has left many organizations in an uncertain and fragile state. It is therefore a fundamental requirement for enterprises to keep pace with data and analytics trends in order to bounce back from the crisis and gain competitive advantage.

From crisis to opportunity, the role of data and analytics is expanding and becoming more strategic and critical. Society in general is becoming increasingly more digital, complex, global with ever growing competition and emancipated customers. Massive disruption, crisis and the ensuing economic downturn are forcing companies to respond to previously unimaginable demands to resource optimize, reinvent processes, and rethink products, business models and even their very purpose.

It is therefore obvious that Data & Analytics is central for enterprises navigating their way out the devastating effects of these crisis, however, the lack of trust and access to data has never been a greater challenge.

Success at scale for maximum business impact with data & analytics depends more than ever on building a foundation of trust, security, governance, and accountability.

We share in this article, the current Data & Analytics trends to help your business thrive:

#1 – The use of new AI techniques

By the end of 2024, 75% of enterprises will shift from piloting to operationalizing AI, driving 5X increase in streaming data and analytics infrastructures.

Within the current context, AI techniques such as machine learning, optimisation and natural language processing are providing vital insights and predictions about the spread of the virus and the effectiveness and impact of countermeasures. With the more commercial use of AI, organizations are discovering new and smarter techniques, including reinforcement learning and distributed learning, interpretable systems, and efficient infrastructures that handle their own complex business situations.

#2 – Less Dashboards

By 2025, data stories will be the most widespread way of consuming analytics, and 75% of stories will be automatically generated using augmented analytics techniques.

Today, business employees struggle to know what insights to act on because Business intelligence platforms are not contextualized, easily interpretable or actionable by the majority of users. Visual analytics and exploration will be replaced by more automated and customized experiences in the form of dynamic data stories. As a result to the shift to more dynamic, in-context data stories, the percentage of time spent on predefined dashboards will decline!

#3 – Decision intelligence

By 2023, more than 33% of large organizations will have analysts practicing decision intelligence including decision modeling.

A brief definition of decision intelligence is that it is a practical domain that frames a wide-range of decision-making techniques and integrates them to all critical parts of people, processes and technologies. It provides framework that brings traditional and advanced disciplines together to design, model, and execute and monitor decision models and processes in the context of business outcomes.</p>

The use of intelligent decision making will bring together decision management and techniques such as descriptive, diagnostic, predictive and prescriptive analytics.

#4 – Augmented Data Management: Metadata is the new black

By 2023, organizations utilizing active metadata, machine learning and data fabrics to dynamically connect, optimize and automate data management processes will reduce time to integrated data delivery by 30%.

The combination of colossal data volume, data trust issues and an ever increasing diversity of data formats is accelerating the demand for automated data management. In response, the potential to utilize metadata analytics poses a new solution to augmenting data management tasks. It is no secret that organizations need to easily know what data they have, what it means, how it delivers value, and whether it can be trusted. Metadata will emerge from a passive state to a highly active utilization state. Active utilization leverages cataloging, automatic data discovery by interpreting use cases and implies taxonomy and ontology that is crucial to data management.</p>

Through augmented data catalog, users can improve data inventorying efforts by significantly augmenting the otherwise cumbersome tasks of finding, tagging, annotating and sharing metadata.

#5 – Moving to the Cloud

By 2022, public cloud services will be essential for 90% of data and analytics innovation.

As Data Management accelerates its journey to the cloud, so will data & analytics disciplines. Cloud environments enable a more agile, fluid, diverse ecosystem that accelerates innovation in response to changing business needs that are not readily available in on-premises solutions. It also provides opportunities regarding cost optimization. It is expected to see offers such as “cloud first” capabilities, eventually become “cloud only” capabilities.

Gartner clients can read more in the report “Top 10 Trends in Data and Analytics, 2020.”

What is data discovery?

by Zeenea Software | Jul 3, 2020 | Metadata Management

In this age where data is all around us, organizations have increasingly been investing in data management strategies in order to create value and gain competitive advantage. However, according to a study conducted by Gemalto in 2018, it was found that 65% of organizations can’t analyze or categorize all the consumer data they store.

It is therefore crucial for enterprises to look for solutions that allow them to seek out the value of their data from the metrics, insights and information by facilitating their data discovery journey.

Data discovery definition

Data discovery problems are everywhere in the enterprise, whether it’s in the IT, Business Intelligence or Innovation department. By integrating data discovery solutions, enterprises provide data access to all employees, enabling Data teams and Business analysts to understand and thus, collaborate on data related topics.

It is also very useful for enterprises seeking better compliance management. It allows organizations to know what data is personal/sensitive and where it can be found. In addition, data discovery can bolster innovation, as it unblocks essential information for satisfying customers and gaining competitive advantage.

From Manual to Smart Data Discovery

For 20 years, before advanced machine learning techniques, data specialists mapped their data using the sole brain power of humans! They critically thought out about what data they had, where it was stored, and what are the needs to be provided to the end customer. Data Stewards usually took care of data assets documentation rules and standards that guided the data discovery process. In these manual approaches, usually done using Excel sheets, people conceptualized and drew out maps to comprehend their data’s organization.

Nowadays, with the advancement of technology, the definition of data discovery includes automated ways of presenting data. Smart Data Discovery represents a new wave of data technologies that use augmented analytics, Machine Learning and Artificial Intelligence. It not only prepares, conceptualizes and integrates data, but also presents it through intelligent dashboards to reveal hidden patterns and business insights.

The benefits of data discovery

Enterprise data moves from one location to another in the speed of light, and is being stored in various data sources and storage applications. Employees and partners are accessing this data from anywhere and anytime, so identifying, locating and classifying your data in order to protect it and gain insights from it should be the priority!

The benefits of data discovery include:

A better understanding of enterprise data, where it is, who can access it and where, and how it will be transmitted,
Automatic data classification based on context,
Risk management and regulatory compliance,
Complete data visibility,
Identification, classification, and tracking of sensitive data,
The ability to apply protective controls to data in real time based on predefined policies and contextual factors

Data discovery enables enterprises to adequately assess the full data picture.

On one hand it helps implement the appropriate security measures to prevent the loss of sensitive data and avoid devastating financial and reputational consequences for the enterprise. On the other, it enables teams to dig deeper into the data to identify the specific items that reveal the answers and find ways to show answers. It’s a win-win situation!

Learn more about data discovery in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Data science: accelerate your data lake initiatives with metadata

by Zeenea Software | Jun 15, 2020 | Data Catalog, Metadata Management

Data lakes offer an unlimited storage for data and present lots of potential benefits for data scientists in the exploration and creation of new analytical models. However, this structured, unstructured and semi-structured data are mashed together and the business insights they contain are often overlooked or misunderstood by data users.

The reason for this is that many technologies used to implement data lakes lack the necessary information capabilities that organizations usually take for granted. It is therefore necessary for these enterprises to manage their data lakes by putting in place effective metadata management which considers metadata discovery, data cataloguing, and overall enterprise metadata management applied to the company’s data lake.

2020 is the year that most data and analytics use cases will require connecting to distributed data sources, leading enterprises to double their investments in metadata management. – Gartner 2019.

How to leverage your data lake with metadata management

To get value from your data lake, it is essential for companies to have both skilled users (such as data scientists or citizen data scientists) and effective metadata management for their data science initiatives. To begin with, an organization could focus on a specific dataset and its related metadata. Then, leverage this metadata as more data is added into the data lake. Setting up metadata management can make it easier for data lake users to initiate this task.

Here are the areas of focus for successful metadata management in your data lake:

Creating a metadata repository

Semantic tagging is essential for discovering enterprise metadata. Metadata discovery is defined as the process of using solutions to discover the semantics of data elements in datasets. This process usually results in a set of mappings between different data elements in a centralized metadata repository. This allows data science users to understand their data and have visibility on whether or not they are clean, up-to-date, trustworthy, etc.

Automating metadata discovery

As numerous and diverse data gets added to a data lake on a daily basis, maintaining ingestion can be quite a challenge! By using automated solutions not only does it make it easier for data scientists or CDS to find their information but it also supports metadata discovery.

Data cataloguing

A data catalog consists of metadata in which various data objects, categories, properties and fields are stored. Data cataloguing is both used for internal and external data (from partners or suppliers for example). In a data lake, it is used for capturing a robust set of attributes for every piece of content within the lake and enriches the metadata catalog by leveraging these information assets. This enables data science users to have a view into the flow of the data, perform impact analysis, have a common business vocabulary and accountability and an audit trail for compliance.

Data & Analytics Governance

Data & analytics governance is an important use case when it comes to metadata management. Applied to data lakes, the question “could it be exposed?” must become an essential part of the organization’s governance model. Enterprises must therefore extend their existing information governance models to specifically address business analytics and data science use cases that are built on the data lakes. Enterprise metadata management helps in providing the means to better understand the current governance rules that relate to strategic types of information assets.

Contrary to traditional approaches, the key objective of metadata management is to drive a consistent approach to the management of information assets. The more metadata semantics are consistent across all assets, the greater the consistency and understanding, allowing the leveraging of information knowledge across the company. When investing in data lakes, organizations need to consider an effective metadata strategy for those information assets to be leveraged from the data lake.

Start metadata management with Zeenea

As mentioned above, implementing metadata management into your organization’s data strategy is not only beneficial, but essential for enterprises looking to create business value with their data. Data science teams working with various amounts of data in a data lake need the right solutions to be able to trust and understand their information assets. To support this emerging discipline, Zeenea gives you everything you need to collect, update and leverage your metadata through its next generation platform!

Check out our metadata management platform

Build your citizen data scientist team

by Zeenea Software | Jun 8, 2020 | Metadata Management

”There aren’t enough expert data scientists to meet data science and machine learning demands, hence the emergence of citizen data scientists. Data and analytics leaders must empower “citizens” to scale efforts, or risk failure to secure data science as a core competency”. – Gartner 2019

As data science provides competitive advantages for organizations, the demand for expert data scientists is at an all-time high. However, supply remains pretty scarce for that demand! This limitation is a threat for enterprises’ competitiveness, and in some cases, their survival in the market.

In response to this challenge, an important analytical role providing a bridge between data scientists and business functions was born: the citizen data scientist.

What is a citizen data scientist?

Gartner defines the citizen data scientist as “an emerging set of capabilities and practices that allows users to extract predictive and prescriptive insights from data while not requiring them to be as skilled and technically sophisticated as expert data scientists”. A “Citizen Data Scientist” is not a job title. They are “power users” who can perform both simple and sophisticated analytical tasks.

Typically, citizen data scientists don’t have coding expertise but can nevertheless build models using drag-and-drop tools and run prebuilt data pipelines and models using tools such as Dataiku. Be aware: citizen data scientists do NOT replace expert data scientists! They bring their own expertise but do not have the specialized expertise for advanced data science.

The citizen data scientist is a role that has evolved as an “extension” from other roles within the organization! This means that organizations must develop a citizen data scientist persona. Potential citizen data scientists will vary based on their skills and interest in data science and machine learning. Roles that filter into the citizen data scientist category include:

Business analysts
BI Analysts / Developers
Data Analysts
Data Engineers
Application Developers
Business line manager

How to empower citizen data scientists?

As expert skills for data science initiatives tend to be quite expensive and difficult to come by, utilizing a citizen data scientist can be an effective way to close the current gap.

Here are ways you can empower your data science teams:

Break enterprise silos

As I’m sure you’ve heard this many times before, many organizations tend to operate independently in silos. Mentioned above, all of roles are important in an organization’s data management strategy, and they all have expressed interest in learning about data science and machine learning skills. However, most data science and machine learning knowledge is siloed in the data science department or specific roles. As a result, data science efforts are often invalidated and unleveraged. Lack of collaboration between data roles makes it difficult for citizens data scientists to access and understand enterprise data!

By establishing a community of both business and IT roles that provides detailed guidelines and/or resources allows enterprises to empower citizens data scientists. It is important for organizations to encourage the sharing of data science efforts throughout the organization and thus, break silos!

Provide augmented data analytics technology

Technology is fueling the rise of the citizen data scientist. Traditional BI vendors such as SAP, Microsoft and Tableau Software, provide advanced statistical and predictive analytics as part of their offerings. Meanwhile, data science and machine learning platforms such as SAS, H2O.ai and TIBCO Software, provide users that lack advanced analytics capabilities with “augmented analytics”. Augmented analytics leverages automated machine learning to transform how analytics content is developed, consumed and shared. It includes:

Augmented data preparation: machine learning automation to augment data profiling and quality, modeling, enrichment and data cataloguing.

Augmented data discovery: enables business and IT users to automatically find, visualize and analyse relevant information, such as correlations, clusters, segments, and predictions, without having to build models or write algorithms

Augmented data science and machine learning: automates key aspects of advanced analytics modeling such as feature selection, algorithm selection and time-consuming step processes.

By incorporating the necessary tools and solutions and extending resources and efforts, enterprises can empower citizen data scientists!

Empower citizen data scientists with a metadata management platform

Metadata management is an essential discipline for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets. By implementing a metadata management strategy, where metadata is well-managed and correctly documented, citizen data scientists are able to easily find and retrieve relevant information from an intuitive platform.

Discover our tips for starting metadata management in only 6 weeks by downloading our new white paper “The effective guide to start metadata management”!

How a business glossary empowers your data scientists

by Zeenea Software | May 26, 2020 | Metadata Management

In the data world, a business glossary is a sacred text that represents long hours of hard work and collaboration between the IT & business departments. In metadata management, it is a crucial part of delivering business value from data. According to Gartner, It is one of the most important solutions to put in place in an enterprise to support business objectives.

To help your data scientists with their machine learning algorithms and their data initiatives, a business glossary provides clear meanings and context to any data or business term in the company.

Back to basics: what is a business glossary?

A business glossary is a place where business and/or data terms are defined and accessible within the entire organization. As simple as it may sound, it is actually a common problem; not all employees agree or share a common understanding of even basic terms such as “contact” or “customer”.

Its main objectives, among others, are to:

Use the same definitions and create a common language between all employees,
Have a better understanding and collaboration between business and IT teams,
Associate business terms to other assets in the enterprise and offer an overview of their different connections,
Elaborate and share a set of rules regarding data governance.

Organizations are therefore able to have information as a second language.

How does a business glossary benefit your data scientists?

Centralized business information allows to share what is essentially tribal knowledge, around an enterprise’s data. In fact, it allows Data Scientists to make better decisions when choosing which datasets to use. It also allows:

A data literate organization

Gartner predicts that by 2023, data literacy will become an explicit and necessary driver of business value, demonstrated by its formal inclusion in over 80% of data and analytics strategies and change management programs. Increasingly, organizations are realizing this and beginning to look at data and analytics in a new way.

As part of the Chief Data Officer job description, it is essential that all parts of the organization can understand data and business jargons. It helps all parts of the organization to better understand a data’s meaning, context, and usages. So by putting in place a business glossary, data scientists are able to efficiently collaborate with all departments in the company, whether IT or business. There are less communication errors and thus they participate in the construction and improvement of knowledge of the enterprise’s data assets.

The implementation of a data culture

Closely related to data literacy, data culture refers to a workplace environment where decisions are made through emphatic and empirical data proof. In other words, executives make decisions based on data evidence, and not just on instinct.

A business glossary promotes data quality awareness and overall understanding of data in the first place. As a result, the environment becomes more data-driven. Furthermore, business glossaries can help data scientists gain better visibility into their data.

An increase in trusting data

A business glossary ensures that the right definitions are used effectively for the right data. It will assist with general problem solving when data misunderstandings are identified. When all datasets are correctly documented with the correct terminology that is understood by all, it increases overall trust in enterprise data, allowing data scientists to efficiently work on their data projects.

Their time is less spent on cleaning and organizing data, but rather on bringing valuable insights to maximize business value!

Implement a Business Glossary with Zeenea

Zeenea provides a business glossary within our data catalog. Our business glossary automatically connects and imports your glossaries and dictionaries in our tool with our APIs. You can also manually create a glossary within Zeenea’s interface!

Check our business glossary benefits for your data scientists!

Data Culture: 5 steps for your enterprise to acculturate to data

by Zeenea Software | May 18, 2020 | Metadata Management

Exploding quantities of data have the potential to fuel innovation and produce more value for organizations. Stimulated by the hopes of satisfying customers, enterprises have, for the past decade or so, invested in technologies and paid handsomely for analytical talent. Yet, for many, data-driven culture remains elusive, and data is rarely used as the basis for decision making.

The reason is because the challenges of becoming data-driven aren’t technical, but rather cultural. Describing how to inject data into decision-making processes is far more easier than to shift an entire organization’s mindset! In this article, we describe five ways to help enterprises create and sustain data culture at its core.

By 2023, data literacy will become an explicit and necessary driver of business value, demonstrated by its formal inclusion in over 80% of data and analytics strategies and change management programs.

What is data culture?

”Data culture” is a relatively new concept that is becoming increasingly important to put in place, especially for organizations developing their digital and data management strategies. Just like organizational or corporate culture, data culture refers to a workplace environment where decisions are made through emphatic and empirical data proof. In other words, executives make decisions based on data evidence, and not just on instinct.

Data culture gives organizations more power to organize, operate, predict, and create value with their data.

>> Check out our webinar: Why does data culture matter <<

Here are our five tips for creating and sustaining data culture:

Step 1: Align with business objectives

The fundamental objective of collecting, analyzing, and deploying data is to make better decisions.” (McKinsley)

Trusting your data is one of the most important tips for building data culture, as distrust in data leads to disastrous organizational culture. And to trust in data, it must align with business objectives. To drive strategic and cultural changes, it is important for the enterprise to agree on common business goals, as well as the relevant metrics to measure achievements or failures across the entire organization.

Ask yourself the right questions: How can we not only get ahead of our competitors, but also maintain the lead? What data would we need to decide what our next product offering should be? How is our product performing in the market? By introducing data into your business decision-making process, your enterprise will have already made the first step to building a data culture.

Step 2: Destroy data silos

In this case, data silos refer to departments, groups, or individuals who are the guardians of data, who don’t share, or don’t know how to share, data knowledge with other parts of the enterprise. When crucial information is locked away and available to only a few connaisseurs, it prevents your company from developing a cross-departmental data culture. It is also problematic on a technical standpoint: multiple data pipelines are harder to monitor and maintain, which leads to data being stale and obsolete by the time anyone uses it for decision making.

To break data silos, enterprises must put in place a single source of truth. Empower employees to make data-driven decisions by relying on a centralized solution. A data catalog enables both technical and non-technical users to understand and trust in the enterprise’s data assets.

>> Check out our blog post: What is a data catalog? <<

Step 3: Hire data-driven people

When building a data culture, it’s important to hire data-driven people. Enterprises are reorganizing themselves, forcing the creation of new roles to support this organizational change:

Data Stewards

Data Stewards are here to orchestrate an enterprise’s data systems. Often called the “masters of data”, they have the technical and business knowledge of data. Their main mission is to ensure the proper documentation of data and facilitate their availability to their users, such as data scientists or project managers for example.

This profession is on the rise! Their social role allows data stewards to work with both technical and business departments. They are the first point reference for data in the enterprise and serve as the entry point to access data.

Chief Data Officers

Chief Data Officers, or CDOs for short, play a key role in the enterprise’s data strategy. They are in charge of improving the organization’s overall efficiency and the capacity to create value around their data. At first, CDOs had to lead a mission to convince interest organizations to exploit data. The first few years of this mission were often supported by the construction of a data universe adapted to new uses, often in the form of a Data Lake or Data Mart. But with the exponential development of data, the role of the CDO took a new scope. From now on CDOs must reconsider the organization in a cross-functional and globalizing way. They must become the new leaders of Data Democracy!

In order to obtain the support for data initiatives from all employees, they must not only support them in understanding data (original context, production, etc.) but also help them to invest in the production strategy and the exploitation of data.

Step 4: Don’t neglect your metadata

When data is created, so is metadata (its origin, format, type, etc.). However, this type of information is not enough to properly manage data in this expanding digital era; data managers must invest time in making sure this business asset is properly named, tagged, stored, and archived in a taxonomy that is consistent with all of the other assets in the enterprise.

This metadata allows for enterprises to assure greater Data quality & discovery, allowing data teams to better understand their data. Without metadata, enterprise find themselves with datasets without context, and data without context has little value.

Step 5: Respect the various data regulations

If you’re in Europe, this is old news by now. With the GDPR put into place in May 2018 as well as all of the other various regulations slowly seeing the day in the United States, UK, or even Japan, it is important for enterprises to respect and follow the guidelines to conform.

>> If you aren’t sure if you conform, check out our articles on GDPR compliance <<

Implementing data governance is a way to ensure that all personal data privacy, data security, and ensure risk management. It is a set of practices, policies, standards, and guides that will supply a solid foundation to ensure that data is properly managed thus, creating value within an organization.

Step 6 BONUS TIP: Choose the right solutions

Metadata management is the new black: it is an emerging discipline, necessary for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets. A metadata management solution offers enterprises a centralized platform to empower all data users in their data culture implementation.

For more information on metadata management, contact us!

Empower your data users with the right metadata management solution

by Zeenea Software | May 5, 2020 | Metadata Management

In order to optimize their business strategies and improve productivity, data-driven organizations are seeing a shift in their data management practices; going from managing and maintaining data, to also managing and maintaining metadata. However, Gartner states that only 5% to 20% of enterprises are equipped with metadata management solutions! This enterprise level discipline is therefore, an essential practice that will continue to develop for the next few years:

”By 2023, 80% of organizations will require solutions that respond to the needs and use cases of their business personas.” – Gartner.

The challenges & concerns of metadata management solutions for enterprises

Current metadata management solutions in the market fall short in responding to new organization concerns. They usually face:

A lack of adoption of current metadata management solutions,
A lack of confidence in the data being analyzed,
Inability to find data in an organization’s data ecosystem,
Solutions that are designed for a technical user and not a business user.

These findings, made by various enterprises with current metadata management solutions, were pretty harsh, and for the right reasons! In the search for creating a data democracy culture, enterprises find themselves stuck with very technical tools aimed for the IT department, usually abandoned by business users or misunderstood by the enterprise in general. This technology-driven approach of metadata management leads to a data literacy gap!

Yet, if enterprises seek implementing metadata management, then they have already realized the value of metadata in the first place. So, the problem here isn’t the discipline itself, but rather the dissatisfaction and the irrelevance of the chosen tool and its characteristics.

The six elements to look out for when choosing a metadata management solution

The original target markets of any metadata management solution were the IT departments searching for a better way to understand data. However, as time passes, other business functions have become more involved with owning or working with data and metadata. Among these personas are data stewards, data analysts, business analysts, data scientists, data architects and data engineers.

Here are six elements that are essential for a adopting a successful long term metadata management strategy:

A personalized user experience

As mentioned above, the data literacy gap is still too wide, and as a response, metadata management solutions must have features and functionalities that support data literacy goals in a business-friendly way. Searching capabilities are very effective in a metadata management solution. Through a Googlesque search engine, data users are able to find relevant information via simple keywords or phrases. Helps menus, drag and drops and wizards are also other common examples.

Metadata management solutions that include artificial intelligence and machine learning features allow for more customizable, user-friendly experiences. As the information requested and viewed by data users varies from person to person, it is important that a metadata management solution offers adaptive & personalized interfaces based on their usages. This information should be displayed with easy and eye-catching visual representations in order to avoid spending too much time trying to understand the data.

Role & Access support

In today’s world, data users often change roles and enterprises find themselves spending most of their time re-configuring who has access to what information. Adopting a successful metadata management solution means being able to easily configure and define roles & edition modes within the platform.

This allows for an overview of all users: who are my data stewards? Who are my data users on X project? Who is the owner of this dataset so I can request permission to access this information? Who last updated this dataset? Role and access management is essential for long term metadata management.

Reporting features

One of the driving factors for organizations seeking metadata management solutions is the need to gain trusted results from the data flow & analytics. These solutions should have dashboards that are relevant and understandable to business users, associated with their business use case.

Supported by a data catalog, these reporting features give information on whether it’s a valuable solution for your enterprise. A data catalog reports the volume of data collected, the number of users it has, the frequency to which users connect to the catalog, how many times a dataset’s been viewed or even its frequently asked questions. It can also provide information regarding data documentation, for example it’s completion level, whether it contains personal information, etc.

A business glossary

As mentioned above, metadata management solutions should enable business users to navigate through content aligned with their use cases. This includes fields and labels that these users can create themselves, rather than adhere to the tool’s taxonomie and semantics.

Business glossary functionalities must provide personalized and modular templates when creating data terms and taxonomies. Wikis and articles within the tool are not enough! A successful business glossary enables data leaders to build and manage a common business vocabulary and make it available across the entire organization.

Compliance with data regulations

Governance and compliance in general are major drivers for acquiring metadata management solutions.

When choosing a metadata management platform, if your main use case is data governance, it is important to seek solutions that offer automated capabilities regarding your enterprise’s personal information. Automated notifications and data fingerprinting technologies provide augmented cataloging and stewardship capabilities for better data governance.

Collaborative capabilities

Social features are a must for companies seeking metadata management. Discussions, ratings, notes, popularity, notifications and messaging capabilities are all important elements to have. For example, social capabilities allow users to easily communicate with data stewards or specific experts linked to a dataset or project.

Moreover, collective intelligence allows enterprises to leverage crowdsourced information and knowledge. With collaborative features enterprises are able to “archive” previous knowledge by stocking it. Companies thus create data communities and a more data literate organization!

Start metadata management in just 6 weeks!

When it comes to metadata management, Zeenea’s got your back. In this white paper, we share our advice and expertise on implementing iterative metadata management optimized for your context.

Download our white paper

Webinar: 6 weeks to start metadata management

by Zeenea Software | Apr 23, 2020 | News & events

As the Data Innovation Summit was postponed to August 20th-21st, we prepared a 40 minute webinar about our new White Paper : 6 weeks to start metadata management.

Join Luc Legardeur – Co-founder of Zeenea – on May 14th at 11AM!

Metadata management: an essential discipline

Metadata management is still an emerging discipline, but is especially necessary for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets.

Many of them are trying to establish their convictions on the subject and brainstorm solutions to meet this new challenge.

Throughout this webinar, we would like to share our advice and expertise on implementing iterative metadata management in only 6 weeks.

Key takeaways

Thanks to this webinar, you will learn about:

The benefits of Metadata management.
The prerequisites for a successful launch of a metadata management system.
How to manage metadata?
How to start metadata management?

Download our White Paper: “The effective guide to start metadata management”

Can’t attend our webinar? Download our free guide to start metadata management.

Many enterprises are trying to establish their convictions on the subject and brainstorm solutions. As a result, metadata is increasingly being managed, alongside data, in a partitioned and siloed way that does not allow the full, enterprise-wide potential of this discipline.

Download

WhereHows: A data discovery and lineage portal for LinkedIn

by Zeenea Software | Apr 20, 2020 | Data Inspiration, Metadata Management

Metadata is becoming increasingly important for modern data-driven enterprises. In a world where the data landscape is increasing at a rapid pace, and information systems are more and more complex, organizations in all sectors have understood the importance of being able to discover, understand and trust in their data assets.

Whether your business is in the streaming industry such as Spotify or Netflix , the ride sharing industry such as Uber or Lyft, or even the rental business like Airbnb, it is essential for data teams to be equipped with the right tools and solutions that allow them to innovate and produce value with their data.

In this article, we will focus on WhereHows, an open source project led by the LinkedIn data team, that works by creating a central repository and portal for people, processes, and knowledge around data. With more than 50 thousand datasets, 14 thousand comments, and 35 million job executions and related lineage information, it is clear that LinkedIn’s data discovery portal is a success.

First, LinkedIn key statistics

Founded by Reid Hoffman, Allen Blue, Konstantin Guericke, Eric Ly, and Jean-Luc Vaillant in 2003 in California, the firm started out very slowly. In 2007, they finally became profitable, and in 2011 had more than 100 million members worldwide.

As of 2020, LinkedIn significantly grew:

More than 660 million LinkedIn members worldwide, with 206 million active users in Europe,
More than 80 million users on LinkedIn Slideshare,
More than 9 billion content impressions,
30 millions companies registered worldwide.

LinkedIn is definitely a must-have professional social networking application for recruiters, marketers, and even sales professionals. So, how does the Web Giant keep up with all of this data?

How it all started

Like most companies with a mature BI ecosystem, Linkedin started out with a data warehouse team, responsible for integrating various information sources into consolidated golden datasets. As the number of datasets, producers and consumers grew, the team increasingly felt overwhelmed by the colossal amount of data being generated each day. Some of their questions were:

Who is the owner of this data flow?
How did this data get here?
Where is the data ?
What data is being used ?

In response, Linkedin decided to build a central metadata repository to capture their metadata across all systems and surface it through a unique platform to simplify data discovery: WhereHows!

What is WhereHows exactly?

WhereHows integrates with all data processing environments and extracts metadata from them.

Then, it surfaces this information via two different interfaces:

A web application that enables navigation, searching, lineage visualization, discussions, and collaboration,
An API endpoint that empowers the automatization of other data processes and applications.

This repository enables LinkedIn to solve problems around data lineage, data ownership, schema discovery, operational metadata mashup, data profiling, and cross-cluster comparison. In addition, they implemented machine-based pattern detection and association between the business glossary and their datasets, and created a community based on participation and collaboration that enables them to maintain metadata documentation by encouraging conversations and pride in ownership.

There are three major components of WhereHows:

A data repository that stores all metadata
A web server that surfaces data through API and UI
A backend server that fetches metadata from other information sources

How does WhereHows work?

The power of WhereHows comes from the metadata it collects from Linkedin’s data ecosystem. It collects the following metadata:

Operational metadata, such as jobs, flows, etc.
Lineage information, which is what connects jobs datasets together,
The information catalogued such as the dataset’s location, its schema structure, ownership, create date, and so on.

How they use metadata

WhereHows uses a universal model that enables data teams to better leverage the value from the metadata; for example, by conducting a search across the different platforms based on different aspects of datasets.

Also, the metadata in a dataset and the job operational metadata are two endpoints. The lineage information connects them together and enables data teams to trace from a datasets/jobs to its upstream/downstream jobs/datasets. If the entire data ecosystem is collected into WhereHows, they can trace the data flow from start to finish!

How they collect metadata

The method used to collect metadata depends on the source. For example, Hadoop datasets have scraper jobs that scan through HDFS folders and files, reads the metadata, then stores it back.

For schedulers such as Azkaban, they connect their backend repository to get the metadata, aggregate it and transform it to the format they need, then load it into WhereHows. For the lineage information, they parse the log of a MapReduce job and a scheduler’s execution log, then combine that information together to get the lineage.

What’s next for WhereHows?

Today, WhereHows is actively used at Linkedin as not only a metadata repository, but also to automate other data projects such as automated data purging for compliance. In 2016, they integrated with systems down below:

In the future, Linkedin’s data teams hope to broaden their metadata coverage by integrating more systems such as Kafka or Samza. They also plan on integrating with data lifecycle management and provisioning systems like Nuage or Goblin to enrich the metadata. WhereHows has not said its final word!

Sources:

50 of the Most Important LinkedIn Stats for 2020: https://influencermarketinghub.com/linkedin-stats/
Open Sourcing WhereHows: A Data Discovery and Lineage Portal:
https://engineering.linkedin.com/blog/2016/03/open-sourcing-wherehows–a-data-discovery-and-lineage-portal

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Data management: don’t neglect your metadata!

by Zeenea Software | Apr 15, 2020 | Metadata Management

Data management can be defined as the process of ingesting, storing, organizing and maintaining all data created and collected by an organization in order to help drive operational decision-making and strategic planning.

It won’t be a surprise if we tell you that data topics are constantly evolving and becoming more complex within organizations! As a result, any organization considering these large-scale data and analytics initiatives is increasingly faced with high-volume data of various types, formats, and distributed environments.

In an attempt to maximize its value, metadata is a response that provides knowledge about where the data is located, what attributes it has, or how it is linked (also called a knowledge graph). Yet most organizations do not yet have a formal approach to metadata management.

Let us convince you of its necessity in this article…

The challenges of metadata in a next-gen data management

In an increasingly dispersed and complex technology environment, Data Managers or, Chief Data Officers, are tasked with providing and simplifying a consistent data environment that can be activated by their teams.

Among our clients who have taken the gamble of initiating metadata management, we see a common objective: to ensure the visibility of different data sources and initiatives and to involve new players who do not necessarily have technical profiles.

In short, the need to align semantics across multiple data silos is driving an increased demand for metadata governance capabilities.

See in this new discipline of data management, a lever to better describe your data, by including information on its location in order to facilitate the use and/or protection in diverse environments and sources.

Here is an excerpt of the questions your metadata will be able to answer:

Who created this data?
Who is responsible for this data?
In what applications is it used?
What is the level of reliability (quality, speed, etc.) of this data?
What are the permitted contexts of use (e.g. confidentiality)?
Where is the data located?
Where does this data come from (a partner, open data, internally, etc.)?

Create a metamodel template!
In this toolkit, we highlight a set of questions that you will be able to answer using the metadata collected from your systems and your own knowledge.

Download

Our recommendations to data management stakeholders

For those who are today approaching metadata management as part of data management strategies, we advise to :

Progressively deploy an enterprise data catalog by adopting metadata management practices. The use of data catalogs will allow, among other things, to inventory all forms of metadata – technical, but also increasingly business, operational and social – in order to improve the visibility of data management activities.

Work with suppliers who are able to accept this diversity in their systems and operate in distributed, independent and increasingly cloud-connected data management infrastructures.

Identify metadata management use cases that can be easily activated in order to quickly prove its value. The solution providers selected should be those that automate the discovery, profiling and inventorying of metadata or at least the most tedious of tasks.

Go further by downloading our metadata management guide!

It will guide you in implementing a metadata management strategy in just 6 weeks.

Download

How do I start metadata management?

by Zeenea Software | Apr 10, 2020 | Metadata Management

Metadata management is emerging discipline and is necessary for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets.

Many of them are trying to establish their convictions on the subject and brainstorm solutions to meet this new challenge. As a result, metadata is increasingly being managed, alongside data, in a partitioned and siloed way that does not allow the full, enterprise-wide potential of this discipline.

How to successfully launch a metadata management platform
How do manage metadata
How to start metadata management

To value this “new” discipline in your organization, you need to demonstrate its ability to deliver value from its outset! At Zeenea, we offer strong data catalog support to produce value in a very short time frame, in most cases a matter of a few weeks.

In this article, we deliver an approach facilitated by a solution like Zeenea: connected, agile and agnostic about the technologies used in enterprises

Set a milestone

For each milestone, you will be asked to identify several elements:

What are the problems: Increasing and sharing knowledge must solve a problem. It can be of various types: compliance for a scheduled audit, centralization and uniformity of a particular piece of information to satisfy a group of collaborators in a difficult situation, for example.

What data: It is important to focus efforts on data directly related to the identified problem. Trying to deal with too large a data set will lengthen the time frame and ultimately extend the time at which the achievement of the objective, and therefore the production of value, could be measured.

Who are my data users: On the first iteration, which may be longer than any other, the mobilized users will be structuring. They must be able to free up enough time to invest themselves in achieving the objective, but they must also have the motivation to set up metadata management in place. These users will be your first ambassadors moving forward.

What time do I have: This iteration must be completed within a reasonably short period of time. On an enterprise-wide basis, we recommend a timeframe between 4 weeks to 3 months maximum, depending on the bandwidth of the people in charge of the subject. This duration should also help to qualify whether a particular problem is appropriate or should be subdivided, or simply discarded.

Sometimes, even before the first milestone is identified, a preliminary introspective exercise on the enterprise’s data governance maturity is carried out.

We suggest this through workshops during which the company, with the help of our maturity matrix, will be able to define its positioning. This type of exercise is of particular interest when it is carried out regularly (e.g. every year). It allows a global assessment of the benefits of deploying your governance program.

Launch the first milestone

Typically, the chronological sequence for the onboarding phase supported by our metadata management tool is the following:

Our desire is to anchor the launch in a value-producing reflex. Each iteration must bring the company tangible benefits that address your issues. This first iteration includes elements that will not appear anymore, or at least much less, in the following iterations, in particular the technical aspects related to the implementation of the solution.

We suggest by default iterations of 6 weeks. This duration, which is fairly arbitrary, corresponds relatively well to the time generally required to produce significant value while not disrupting too much the activity of the people involved. Indeed, it is necessary to keep in mind that it is rare that the mobilized collaborators have full time to deal with the subject.

Get started with metadata management with our guide

Do you want more details on how to launch your metadata management project? Download our guide to starting effective metadata management!

DOWNLOAD

How do I manage my metadata?

by Zeenea Software | Apr 10, 2020 | Metadata Management

Metadata management is emerging discipline and is necessary for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets.

How to successfully launch a metadata management platform
How do manage metadata
How to start metadata management

Is your organization in the process, or in the process of reflecting, the development of a metadata management solution? Metadata management is essential to meet the growing demands of data governance, data risk and compliance, as well as data analysis and value generation.

In order to support this discipline, you will have to choose a metadata management platform. Simply put, these solutions must enable data managers to capture, store and aggregate metadata from the enterprise IS on a single platform.

You will soon realize that the market is complex: the solutions are diverse and their scope or capacity more or less limited. Take some time to assess the functional capabilities of your metadata management solution to help you:

Centralize your efforts

Ensure that metadata efforts are not isolated but centralized and unified.

In this way you will avoid reproducing silos of information in the same way that data has been siloed in the past. To do this, check the platform’s operational ability to integrate and manage assets coming from different sources of information.

After all, metadata is everywhere in your IS: applications, relational or non relational databases, the cloud, business glossaries, data dictionaries or even Excel files!

Match your context

We are firmly convinced that it is not up to the company to comply with the solution’s documentation model, but rather the other way around!

By adopting a modular and customizable solution, you will be able to adjust, prioritize and add the missing and necessary elements for your data consumers. From this approach, take a step-by-step approach to incremental and iterative metadata governance that fits your context according to your priorities. So take some time to collect the needs and difficulties of your data users and consumers. Then, create a template of relevant documentation for your consumers that will ultimately guide your metadata collection efforts.

Quicken the pace (and well!)

We consider the automation of the ingestion and the intelligence of the solution as the key factors of success! These capabilities will allow you to automate the most tedious tasks, offer synergies in your data assets in order to document and contextualize your data in an ever increasing and better way. Under the regulatory prism, for example, an intelligent platform will be able to identify through its already existing documentation which data sets are considered «sensitive» and contain personal data.

Explore your metadata

Aggregating IS metadata into a platform only makes sense if it is shared across the enterprise and easily accessible by your data consumers. Data catalogs respond to this first use case:

The easier it is for the user to access the information, the better the data catalog! To enable analysts, data scientists and other data consumers to find and understand an enterprise’s data assets in order to extract value from them

Start metadata management in just 6 weeks.

In many organizations, metadata management is still a manual, extremely time consuming task undertaken by more technical profiles for technical profiles.

As a result, metadata management as a discipline has gone largely unnoticed by data and analysis stakeholders. The ability of teams to explain its benefits or demonstrate its value has been and continues to be difficult.

Download our guide to start your enterprise metadata management journey! In this white paper, we share our advice and expertise on implementing iterative metadata management optimized for your context.

DOWNLOAD

How to successfully launch a metadata management platform

by Zeenea Software | Apr 10, 2020 | Metadata Management

Metadata management is emerging discipline and is necessary for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets.

How to successfully launch a metadata management platform
How do manage metadata
How to start metadata management

Before starting a metadata management project, here are some elements to consider:

Accepting failure

As strong as this title may be, fearing failure won’t avoid it. Being aware of risk and knowing how to integrate it into the approach is a crucial part of launching a metadata management platform.

To accept failure is to admit that the road will not be paved with simple and obvious steps

Implementing data governance around relevant metadata management is a complex subject, exacerbated by many factors: size and complexity of the organization, culture or sensitivity concerning data as a subject, awareness of the associated strategic issues, etc. Naturally, a complex subject entails a certain risk during operational execution…

Experimenting with your data environment

Metadata management is built gradually and no revelation will strike the team at its initialization. Only experimentation will make it possible to validate decisions.

In order to control the costs of these various experimentations, the most appropriate approach is to progress step by step. Accepting failure is not to resign oneself, on the contrary, it puts oneself in a position where efforts are not only focused on anticipation, but also on remediation and adaptation. Hypotheses are going to be validated successfully, while modifying a number of parameters as little as possible each time, and the conclusions, through measurement, will allow progress to be made.

This operation is wholly iterative and incremental. One thus finds fundamentals promoted in a rather general, agile way

Aligning with enterprise objectives

The objectives of your metadata governance can be local or global.

The approach may concern only a limited perimeter in the enterprise, and reflect a very local initiative, whereas conversely, it may be intended to apply to the enterprise as a whole. Even more so when they are global to the enterprise, and therefore often expressed in a rather generalist way, it is important to ensure that their implementation remains in line with the original idea

An important element in this equation is the human dimension! On the one hand because responsibilities will be identified, but also because certain processes will either have to evolve or be defined, or because the culture and preconceptions surrounding the data will have to be changed with a great deal of communication.

Nevertheless, adapting is in general, a healthy enterprise practice.

Prioritizing

Among the benefits of such an approach is, as mentioned above, better risk
control.

But there is another obvious benefit: the possibility of a faster return
on investment. The first effects should be noticeable as soon as the first iteration is complete.

The objective must be determined to produce concrete value for the enterprise.

Selecting the useful information

Being selective about the nature of the information characterizing the data helps with identifying your useful information. The temptation of an overly ambitious metamodel could actually be detrimental to the qualitative effort required of Data Stewards profiles.

We therefore recommend a very precise selection of metadata that meets the objectives given by iteration. The organization of knowledge in a metadata catalog will be optimized, both for the contributors and the Stewards, as well as for the users in search of information. Quality must take precedence over quantity, and the iterative approach will meet the expectations of enrichment progressively as deployment progresses.

Capitalize on your metadata!

Last but not least, your experimentation will result in local initiatives that may give rise to reflections on the generalization of all or part of the achievements.

To capitalize is to know how to identify what is in the common interest.

Start metadata management in just 6 weeks.

In many organizations, metadata management is still a manual, extremely time consuming task undertaken by more technical profiles for technical profiles.

DOWNLOAD

How Spotify improved their Data Discovery for their Data Scientists

by Zeenea Software | Mar 19, 2020 | Data Inspiration, Metadata Management

As the world leader in the music streaming market, it is without question that the huge firm is driven by data.

Spotify has access to the biggest collections of music in the world, along with podcasts and other audio content.

Whether they’re considering a shift in product strategy or deciding which tracks they should add, Spotify says that “data provides a foundation for sound decision making”.

Spotify in numbers

Founded in 2006 in Stockholm, Sweden, by Daniel Ek and Martin Lorentzon, the leading music app’s goal was to create a legal music platform in order to fight the challenge of online music piracy in the early 2000s.

Here are some statistics & facts about Spotify in 2020:

248 million active users worldwide,
20,000 songs are added per day on their platform,
Spotify has 40% share of the global music streaming market,
20 billion hours of music were streamed in 2015

These numbers not only represent Spotify’s success, but also the colossal amounts of data that is generated each year, let alone each day! To enable their employees, or as they call them, Spotifiers, to make faster and smarter decisions, Spotify developed Lexikon.

Lexikon is a library of data and insights that helps employees find and understand their data and knowledge generated by their expert community.

What were the data issues at Spotify?

In their article How We Improved Data Discovery for Data Scientists at Spotify, Spotify explains that they started their data strategy by migrating data to the Google Cloud Platform, and saw an explosion of their datasets. They were also in the process of hiring many data specialists such as data scientists, analyst, etc. However, they explain that datasets lacked clear ownership and had little-to-no documentation, making it difficult for these experts to find them.

The next year, they released Lexikon, as a solution for this problem.

Their first release allowed their Spotifiers to search and browse through available BigQuery tables as well as discover past researches and analysis. However, months after the launch, their data scientists were still reporting data discovery as a major pain point, spending most of their time trying to find their datasets therefore delaying informed decision-making.

Spotify decided then to focus on this specific issue by iterating on Lexikon, with the unique goal to improve data discovery experience for data scientists.

How does Lexikon data discovery work?

In order for Lexikon to work, Spotify started out by conducting research on their users, their needs as well as their pain points. In doing so, the firm was able to gain a better understanding of their users intent and use this understanding to drive product development.

Low intent data discovery

For example, you’ve been in a foul mood so you’d like to listen to music to lift your spirits. So, you open Spotify, browse through different mood playlists and put on the “Mood Booster” playlist.

Tah-dah! This is an example of low-intent data discovery, meaning your goal was reached without extremely strict demands.

To put this into Spotify’s data scientists context, especially new ones, their low intent data discovery would be:

find popular datasets used widely across the company,
find datasets that are relevant to the work my team is doing, and/or
find datasets that I might not be using, but I should know about.

So in order to satisfy these needs, Lexikon has a customizable homepage to serve personalized recommendations to users. The homepage recommends potentially relevant, automatically generated suggestions for datasets such as:

popular datasets used within the company,
dataset recently used by the user,
datasets widely used by the team the user belongs to.

High intent data discovery

To explain this in simple terms, Spotify uses the example of hearing a song, and researching it over and over in the app until you finally find it, and listen to it on repeat. This is high intent data discovery!

A data scientist at Spotify with high intent has specific goals and is likely to know exactly what they are looking for. For example they might want to:

find a dataset by its name,
find a dataset that contains a specific schema field,
find a dataset related to a particular topic,
find a dataset that a colleague used of which they can’t remember the name,
find the top datasets that a team has used for collaborative purposes.

To fulfill their data scientists needs, Spotify focused first on their search experience.

They built a search ranking algorithm based on popularity. By doing so, data scientists reported that their search results were more relevant, and had more confidence in the datasets they discovered because they were able to see which dataset was more widely-used by the company.

In addition to improving their search rank, they introduced new types of properties (schemas, fields, contact, team, etc.) to Lexikon to better represent their data landscape.

These properties are able to open up new pathways for data discovery. In the example down below, a data scientist is searching for a “track_uri”. They are able to navigate through the “track_uri” schema field page and see the top tables containing this information. Since adding this new feature, it has proven to be a critical pathway for data discovery, with 44% of Lexikon users visiting these types of pages.

’

Final thoughts on Lexikon

Since making these improvements, the use of Lexikon amongst data scientists has increased from 75% to 95%, putting it in the top 5 tools used by data scientists!

Data discovery is thus, no longer a major pain point for their Spotifiers.

Sources:

Spotify Usage and Revenue Statistics (2019): https://www.businessofapps.com/data/spotify-statistics/
How We Improved Data Discovery for Data Scientists at Spotify: https://labs.spotify.com/2020/02/27/how-we-improved-data-discovery-for-data-scientists-at-spotify/
75 amazing Spotify Statistics and Facts (2020): https://expandedramblings.com/index.php/spotify-statistics/

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Metadata through the eyes of Web Giants

by Zeenea Software | Mar 17, 2020 | Data Inspiration, Metadata Management

Data life cycle analysis is an element in data management that enterprises are still struggling to implement.

Organizations at the forefront of data innovation such as Uber, LinkedIn, Netflix, Airbnb and Lyft have also seen the value of metadata in the magnitude of this challenge.

They thus developed a metadata management strategy using dedicated platforms. Frequently developed on a custom basis, they facilitate data ingestion, indexing, search, annotation and discovery in order to maintain high quality datasets.

The following examples highlight a shared constant: the difficulty, increased by volume and variety, of transforming business data into exploitable knowledge.

Let’s take a look at the analysis and context of these Web giants:

Uber

Every interaction on Uber’s platform, from their ride sharing services to their food deliveries, is data-driven. Through analysis, their data enables more reliable and relevant user experiences.

Uber’s key stats

thousands of billions of Kafka messages a day,
hundreds of petabytes of data in HDFS in data centers,
millions of analytical queries weekly.

However, the volume of data generated alone is not sufficient to leverage the information it represents; to be used effectively and efficiently, data requires more context to make optimal business decisions.

To provide additional information, Uber therefore developed “Databook”, the company’s internal platform that collects and manages metadata on internal datasets in order to transform data into knowledge.

Databook is designed to enable Uber employees to effectively explore, discover and use Uber’s data. Databook gives context to their data (its meaning, quality, etc) and ensures that it is maintained in its platform for the thousands of employees who want to analyze the data. In short, Databook’s metadata enables data leaders to move from viewing raw data to actionable knowledge.

In the article Databook: Turning Big Data into Knowledge with Metadata at Uber, the article concludes that one of the biggest challenges for Databook was to move from manual metadata repository updates to automation.

Airbnb

At a conference in May 2017, John Bodley, Data Engineer at AirBnB, outlined new issues arising from the company’s growth: a confusing and non-unified landscape that wasn’t allowing access to increasingly important information.
What can we do with all this data collected on a daily basis? How do we turn them into assets for all Airbnb employees?

A dedicated team set out to develop a tool that would democratize access to data within the company. Their work was based both on the knowledge of the analysts and their ability to understand the critical points, and on that of the engineers, who were able to offer a more technical vision. At the heart of the project, interviews of employees concerning their issues were conducted.

What emerged from this survey was a difficulty in finding the information employees needed to work, and a still too tribal approach to sharing and holding information.

To meet these challenges, AirBnB created Data Portal, a metadata management platform. Data Portal centralizes and shares this information via this self-service platform.

Lyft

Lyft is a ride-sharing service and is Uber’s main competitor in the North American market.

The company found they were inefficiently providing data access for its analytical profiles. Its reflections focused on making data knowledge available to optimize its processes. In just a few months, their goal of creating an interface for researching data presented these two major challenges:

Productivity – Whether it’s to create a new model, instrument a new metric, or perform an ad hoc analysis, how can Lyft use this data in the most productive and efficient way possible?
Compliance – When collecting data about an organization’s users, how can Lyft comply with increasing regulatory requirements and maintain the trust of its users?

In their article Amundsen – Lyft’s data discovery & metadata engine, Lyft states that the key does not lie in the data, but in the metadata!

Netflix

As the world leader in video streaming, data exploitation at Netflix is, of course, a major strategic focus.

Given the diversity of their data sources, the video platform wanted to offer a way to federate and interact with these assets from a single tool. This search for a solution led to Metacat.

This tool acts as a layer of access to data and metadata from Netflix data sources. It allows its users to access data from any storage system through three different features:

Adding business metadata: By hand or user-defined, business metadata can be added via Metacat.
Data discovery: The tool publishes schema and business metadata defined by its users in Elasticsearch, facilitating full-text search of information in data sources.
Data Change Notification and Auditing: Metacat records and notifies all changes to metadata from storage systems.

In their blog article, “Metacat: Making Big Data Discoverable and Meaningful”, at Netflix, the firm confirms that they are far from finished working on their solution!

There are a few more features they have yet to work on to improve the data warehousing experience:

Schema and metadata visioning to provide table history.
Provide contextual information on arrays for better data lineage.
Add support for datastores like Elasticsearch and Kafka.

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Amundsen: How Lyft is able to easily discover their data

by Zeenea Software | Feb 27, 2020 | Data Inspiration, Metadata Management

In our last article, we spoke of Uber’s Databook , an in-house platform designed by their very own engineers with the aim to turn data into contextualized assets. In this article, we will focus on Lyft’s very own data discovery and metadata platform: Amundsen.

In response to Uber’s success, the ride-sharing market saw a major wave of competitors arrive and among those, there is Lyft.

Lyft key figures & statistics

Founded in 2012 in San Francisco, Lyft operates in more than 300 cities across the United States and Canada. With over 29% of the US ride-sharing market*, Lyft has certainly secured the second position for itself, standing neck to neck with Uber. Some key statistics on Lyft include:

23 million Lyft users as of January 2018,
More than a billion Lyft rides,
1,4 million drivers (Dec. 2017).

And of course, those numbers have transformed into colossal amounts of data to manage! In a modern data-driven company such as Lyft, it is evident that their platform is powered by their data. With the rapid increase of the data landscape, it becomes increasingly difficult to know what data exists, how to access them and what information is available.

This problem led to the creation of Amundsen, Lyft’s open source data discovery solution and metadata platform.

Let’s get to know Amundsen

Named after the Norwegian explorer Roald Amundsen, Lyft improves their data users productivity by providing an intuitive search interface for data, that looks like this:

While Lyft’s data scientists wanted to spend the majority of the time on model development and production, they realized that most of their time was being spent on data discovery. They would find themselves asking questions such as:

Does this data exist? If it does, where can I find it? Can I access it?
Who / which team is the owner? Who are the common users?
Can I trust this data?

To answer these questions, Lyft was inspired by search engines like Google.

As shown above, their entry point is a simple search box where users can type any keyword such as “customers” “employees” or “price”. However, if the data user does not know what they are looking for, the platform presents the user with a list of the most popular tables, so they can browse through them freely.

Some key features:

The search results are shown in “list form” where the description about the table and the date when the table was last updated appears. The ranking used is similar to Google’s Page Rank, where the most popular and relevant tables show up in the first results.

When a data user at Lyft finds what they’re looking for and selects their choice, the user is directed to a detail page which shows the name of the table as well as its manually curated description. Users can also manually insert tags, the owners, and other descriptions. However, a lot of their metadata is automatically curated such as the table’s popularity or even its frequent users.

When in a table, users are able to explore the associated columns to further discover the table’s metadata.

For example, if you were to select the column “distance_travelled” as shown below, you will find a small definition of the field and its related stats such as the count record, the max count, min count, average count, etc, for data scientists to better understand the shape of their data.

Lastly, users can have access to view the data of the dataset by pressing the preview button of the page. Of course, this is only possible if the user has access to the underlying data in the first place.

How Amundsen democratizes data discovery

Showing the relevant data

Amundsen now empowers all employees at Lyft, from new employees to the most experienced, to become autonomous in their data discovery for their daily tasks.

Now let’s talk technical. Lyft’s data warehouse is on Hive and all physical partitions are stored in S3. Their data users rely on Presto, a live query engine, for their table’s discovery. In order for their search engine to show the most important or relevant tables for their users, Lyft uses the DataBuilder framework to build a query usage extractor that parses query logs to get table usage data. Then, they persist in this table usage as an Elasticsearch table document. And that’s how, in very short, they are able to retrieve the most relevant datasets for their data users.

Connecting data with people

As much as we like to claim how technical and digital we all are, processes for finding data consists mainly in interactions with people. And the notion of Data ownership is quite confusing; it is very time consuming unless you know exactly who to ask.

Amundsen addresses this issue by creating relationships between their users and their data thus, tribal knowledge is shared through exposing these relationships.

Lyft currently has three types of relationships between users and data: followed, owned and used. This information helps experienced employees become helpful resources for other employees with a similar job role. Amundsen also makes the tribal knowledge easier to find thanks to a link to each user profile on the internal employee directory.

They’ve also been working on implementing a notifications feature that would allow users to request more information from the data owners like for example, a missing description in a table.

If you’d like more information on Amundsen, please visit their website here.

What’s next for Lyft

Lyft is hoping to continue working with a growing community to enhance their data discovery experience and boost user productivity. Their roadmap currently includes email notifications system, data lineage, UI/UX redesign, and more!

The ride sharing company has not had its final word yet!

Sources:

Lyft – Statistics & Facts: https://www.statista.com/topics/4919/lyft/
Lyft And Its Drive Through To Success: https://www.startupstories.in/stories/lyft-and-its-drive-through-to-success
Lyft Revenue and Usage Statistics (2019): https://www.businessofapps.com/data/lyft-statistics/
Presto Infrastructure at Lyft: https://eng.lyft.com/presto-infrastructure-at-lyft-b10adb9db01?gi=f100fa852946
Open Sourcing Amundsen: A Data Discovery And Metadata Platform: https://eng.lyft.com/open-sourcing-amundsen-a-data-discovery-and-metadata-platform-2282bb436234
Amundsen — Lyft’s data discovery & metadata engine: https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Databook: How Uber turns data into exploitable knowledge with metadata

by Zeenea Software | Feb 17, 2020 | Data Inspiration, Metadata Management

Uber is one of the most fascinating companies to emerge over the past decade. Founded in 2009, Uber grew to become one of the highest valued startup companies in the world! In fact, there is even a term for their success: “uberization” which refers to changing the market for a service by introducing a different way of buying or using it, especially using mobile technology.

From peer-to-peer ride services to restaurant orders, it is clear Uber’s platform is data driven. Data is the center of the Uber’s global marketplace, creating better user experiences across their services for their customers, as well as empowering their employees to be more efficient at their jobs.

However, Big Data by itself wasn’t enough; the amount of data generated at Uber requires context to make business decisions. So as many other unicorn companies did such as Airbnb with Data Portal, Uber’s Engineering team built Databook. This internal platform aims to scan, collect and aggregate metadata in order to see more clearly on the location of data in Uber’s IS and their referents. In short, a platform that wants to transform raw data into contextualized data.

How Uber’s business (and data) grew

Since 2016, Uber has added new lines of businesses to its platform including Uber Eats and Jump Bikes. Some statistics on Uber include:

15 million trips a day
Over 75 million active riders
18,000 employees since its creation in 2009

As the firm grew, so did its data and metadata. To ensure that their data & analytics could keep up with their rapid pace of growth, they needed a more powerful system for discovering their relevant datasets. This led to the creation of Databook and its metadata curation.

The coming of Databook

The Databook platform manages rich metadata about Uber’s datasets and enables employees across the company to explore, discover, and efficiently use their data. The platform also ensures their data’s context isn’t lost among the hundreds of thousands of people trying to analyse it. All in all, Databook’s metadata empowers all engineers, data scientists and IT teams to go from just visualizing their data to turning it into exploitable knowledge

Databook enables employees to leverage automated metadata in order to collect a wide variety of frequently refreshed metadata. It provides a wide variety of metadata from Hive, MySQL, Cassandra and other internal storage systems.To make them accessible and searchable, Databook offers its consumers a user interface with a Google search engine or its RESTful API.

Databook’s architecture

Databook’s architecture is broken down into three parts: how the metadata is collected, stored, and how its data is surfaced.

Conceptually, the Databook architecture was designed to enable four key capabilities:

Extensibility: New metadata, storage, and entities are easy to add.
Accessibility: Services can access all metadata programmatically.
Scalability: Support business user needs and technology novelty
Power & speed of execution

To go further on Databook’s architecture, please read their article https://eng.uber.com/databook/

What’s next for Databook?

With Databook, metadata at Uber is now more useful than ever!

But they still hope to develop other functionalities such as the abilities to generate data insights with machine learning models and create advanced issue detection, prevention, and mitigation mechanisms.

Sources

Databook: Turning Big Data into Knowledge with Metadata at Uber: https://eng.uber.com/databook/
How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions: https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-management-and-discovery-for-machine-9b79ee9184bb
The Story of Uber https://www.investopedia.com/articles/personal-finance/111015/story-uber.asp
The definition of uberization, Cambridge dictionary: https://dictionary.cambridge.org/dictionary/english/uberization

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

What is metadata management?

by Zeenea Software | Jan 27, 2020 | Metadata Management

“By 2021, organizations will spend twice as much effort in managing metadata compared with 2018 in order to assess the value and risks associated with the data and its use.”

*Gartner, The State of Metadata Management

The definition of metadata management

As mentioned in our previous article “The difference between data and metadata”, metadata provides context to your data. And to trust your data’s context, you must understand it. Knowing the who, what, when, where, and why of your data means knowing your metadata, otherwise known as metadata management.

With the arrival of Big Data and the various regulations, data leaders must look further into their data through metadata. Metadata is created whenever data is created, added, deleted from, updated or acquired. For example, metadata in an Excel spreadsheet includes the date of creation, the name, the associated authors, the file size, etc. In addition, metadata could also include titles and comments made in the document.

In the past, a form of metadata management would be look up a book’s call number in a catalog to find its location in a library. Today, metadata management is used in software solutions to comply with data regulations, set up data governance as well as understand the data’s value. Thus, this discipline becomes essential for enterprises!

Why should you implement a metadata management strategy?

The first use case regarding metadata management is to facilitate the discovery and understanding of a person’s or program’s specific data asset.

This requires setting up a metadata repository, populating and generating easy to use information in it.

Here are, among others, benefits of metadata management:

A better understanding of the meaning of enterprise’s data assets,
More communication on a data’s semantics via a data catalog,
Data leaders are more efficient, leading to faster project delivery,
The use of data dictionaries and business glossaries allow the identification of synergies and the verification of coherent information,
Reinforcement of data documentation (deletions, archives, quality, etc.),
Generate audit and information tracks (risk and security for compliance).

Manage your metadata with Zeenea’s metadata management platform

With Zeenea, transform your metadata into exploitable knowledge! Our metadata management platform automatically curates and updates your information from your storage systems. It becomes a unique, up-to-date source of knowledge for any data explorer in the enterprise.

Discover our platform

What is the difference between Metadata and Data?

by Zeenea Software | Sep 3, 2019 | Metadata Management

“Data is content, and metadata is context. Metadata can be much more revealing than data, especially when collected in the aggregate.”

— Bruce Schneier, Data and Goliath

Definitions of Data and Metadata

For the majority of people, the concepts of Metadata and Data are unclear. Even though both are a form of data, their uses and specifications are completely different.

Data is a collection of information such as observations, measurements, facts, and descriptions of certain things. It gives you the ability to discover patterns and trends in all of an enterprise’s data assets.

On the other hand, Metadata, often defined as “data on data”, refers to specific details on these data. It provides granular information on one specific data such as file type, format, origin, date, etc.

Key differences between data and metadata

The main difference between Data and Metadata is that data is the content provides a description, measurement, or even a report on anything relative to an enterprise’s data assets. On the other hand, metadata describes the relevant information on said data, giving them more context for data users.

Data can be processed or unprocessed, such as raw data (numbers, or non-informative characters). The difference with metadata is that it is always considered to be processed information.

Finally, some data is informative and some may not be. However, metadata is always informative as it references other data.

Why is metadata important for Data Management?

With better metadata management comes better data value. This metadata allows for enterprises to assure greater Data quality and discovery, allowing data teams to better understand their data. Without metadata, enterprise find themselves with datasets without context, and data without context has little value.

This is why having a proper metadata management solution is critical for enterprises dealing with data. By implementing a metadata management platform, data users are able to discover, understand, and trust in their enterprise’s data assets.

Are you looking for a metadata management solution?

The role of metadata in a data-driven strategy

by Zeenea Software | May 29, 2019 | Metadata Management

Our conviction requires a company to make compromises between control and flexibility in the use of data. In short, companies must be able to adopt a data strategy both encouraging and easy- to-use, all while minimizing risks.

We are convinced that such governance is achievable if your collaborators are likewise able to answer these few questions:

What data are present in our organization?
Are these data sufficiently documented to be understood and mastered by the collaborators in my organization?
Where do they come from?
Are they secure?
What rules or restrictions apply to my data?
Who are the people in charge? Who are the “knowers”?
Who uses these data? How?
How can your collaborators access it?

These metadata (information about data) become strategic information within enterprises. They describe various technical, operational or business aspects of the data you have.

By constituting a unified metadata repository, both centralized and accessible, you are guaranteed precise data, which are consistent and understood by the entire enterprise.

The benefits of a metadata repository

We bring our experiences to enhance a well-founded governance on metadata management. We are firmly convinced that we cannot govern what we do not know! Thus, to build a metadata repository constitutes a solid working base to start a governance of your data.

It will allow, among others, to :

Curate your asset;
Assign roles and responsibilities on your referenced data;
Be completed by your employees in a collaborative manner;
Strengthen your regulatory compliance.

The concentration of efforts on metadata and the creation of such a frame of reference is one key characteristic of a data governance with an agile approach.

Download our white paper
Why start an agile data governance?

How to become Data Driven according to Charlotte Tilbury

by Zeenea Software | May 6, 2019 | Data Inspiration, Metadata Management

Zeenea’s participation in the AI & Big Data Global Expo in London on the 25th and 26th of April has officially opened the window in becoming the leading data catalog solution for data-driven enterprises. Zeenea is confident that the core of every company’s success is the ability to leverage its data assets, which can be achieved by being a truly data-driven enterprise.

During this expo, we attended some Big Data Business Solutions conferences that aimed to inform and educate on how data assets are the make-or-break of successful business decisions. A common theme across the board was how Data Science and Business Analytics are an integral component of adding value within enterprises. But how exactly can this be built into an existing company?

Dr. Andreas Gertsch Grover, the director of Data Science at Charlotte Tilbury shed light on this hot topic in his conference, How small steps get you to the promised land of a data-driven company, by showing us examples of what actually doesn’t work.

A make-up brand’s own sensational makeover

A UK beauty and makeup brand, Charlotte Tilbury is growing at a rapid rate, with a pre-money valuation at $561.22m. With revenues doubling every year, Charlotte Tilbury is headed towards becoming a unicorn company by the end of 2019 [1]. Aiming to be the best selling celebrity make-up brand, the company invested in building a Data Science team in an effort to use prediction models to vamp up their marketing and customer personalization.

With Dr. Andreas Gertsch Grover leading the way, he explains how Charlotte Tilbury has managed to build a data-driven culture to deliver successful data science projects.

The company’s discrepancy between a company’s expectations and a data scientist

“Know the roles you need in the company and not just hire a data scientist,” says Grover. Data Science projects are very complicated and need to involve all employees in the enterprise. To list a few issues data scientists can face when they join a company:

There is no Data Science infrastructure.
There are loads of data with only some identified areas in need of improvement.
Access to data is difficult with no documentation on these data.

Thus, data scientists are forced to make their own environments and laboriously work on large Data Science projects virtually on their own. But when prediction models are created, they ultimately aren’t used as the company doesn’t even know how to apply it to their particular systems!

So what are the steps that need to be taken to close the gap between a company’s expectations and a data scientist’s role?

The must dos

Grover explains that due to the complex nature of Data Science projects, they must start small and be treated iteratively. By doing this, everyone in the company will be able to be involved in the learning process together. Within this collaborative framework, both employees and business stakeholders will be able to understand the business and ask the right questions, which will lead to the next small, successful project.

The must-haves

Grover supports the necessity of using tools when researching and developing their projects. As data acquisition and exploration can take up an enormous amount of time, by investing in tools that will expedite the process, it will save precious time and improve efficiency. Every person should be able to be independent and find the data they need. This particular need is Zeenea’s main goal by providing a data catalog.

The promised land of a data-driven company

Understanding and managing a company’s expectations is never easy but if everybody in an enterprise works together, the Promised Land of becoming a data-driven company is attainable. By working in small steps, iteratively, employees can learn, collaborate, and deliver major business turnovers that are both tried and true.

Sources

[1] Armstrong, P. (2018, August 13). Here are the U.K. Companies That Will Be Unicorns In 2019 Retrieved from https://www.forbes.com/

Google Goods: The management and data democratization tool of Google

by Zeenea Software | Apr 10, 2019 | Data Inspiration

When you’re called Google, the data issue is more than just central. A colossal amount of information is generated every day throughout the world, by all teams in this American empire. Google Goods, a centralized data catalog, was implemented to cross-reference, prioritize, and unify data.

This article is a part of a series dedicated to data-driven enterprises. We highlight successful examples of democratization and mastery of data within inspiring companies. You can find the Airbnb example here. These trailblazing enterprises demonstrate Zeenea’s ambition and it’s data catalog: to help organizations better understand and use their data assets.

Google in a few figures

The most-used search engine on the planet doesn’t need any introduction. But what is behind this familiar interface? What does Google represent in terms of market share, infrastructure, employees, and global presence?

In 2018, Google had [1]:

l90.6% market share worldwide
30 million indexed sites
500 million new requests every day

In terms of infrastructure and employment, Google represented in 2017 [2]:

70,053 employees
21 offices in 11 countries
2 million computers in 60 datacenters
850 terabytes to cache all indexed pages

Given such a large scale, the amount of data generated is inevitably huge. Faced with the constant redundancy of data and the need for precision for its usage, Google implemented Google Goods, a data catalog working behind the scenes to organize and facilitate data comprehension.

The insights that led to Google Goods

Google possesses more than 26 billion internal data [3]. And this includes only the data accessible to all the company employees.

Taking into account sensitive data that uses secure access, the number could double. This amount of data was bound to generate problems and questions, which Google listed as a reason for designing its tool:

An enormous data scale

Considering the figure previously mentioned, Google was faced with a problem that couldn’t be ignored. The sheer quantity and size of data made it impossible to process all them all. It was hence essential to determine which ones are useful and which ones aren’t.

The system already excludes certain information deemed unnecessary and is successful in identifying some redundancies. Therefore, it’s possible to create unique access roads through data without it being stored in different places within the catalog.

Data variety

Data sets are stocked in a number of formats and in very different storage systems. This makes it difficult to unify data. For Goods, it is a real challenge with a crucial objective: to provide a consistent way to query and access information without revealing the infrastructure’s complexity.

Data relevance

Google estimates that 1 million data are both created and erased on a daily basis. This emphasizes the need to prioritize data and establish their relevance. Some are crucial in processing chains but only have value for a few days, others have a scheduled end of life that can last from several weeks to a few hours.

The uncertain nature of metadata

Many of the data cataloged are from different protocols, making metadata certification complex. Goods therefore proceeds by trial and error to create hypotheses. This is due to the fact that it operates on a post hoc basis. In other words, collaborators don’t have to change the way they work. They are not asked to combine data sets with metadata when they are created. It is up to Goods to work, collect, and analyze data to bring them together and clarify them for future use.

A priority scale

After working on discovery and cataloging, the question of prioritization arises. The challenge is the ability to respond to this question: “What makes a data important?” Providing an answer to this question is much less simple for an enterprise’s data than prioritizing web research, for example. In an attempt to establish a relevant ranking, Goods is based on the interactions between data, metadata, and other criteria. For instance, the tool considers that data is more important if its author has associated a description to go with it, or if several teams consult, use or annotate it.

Semantic data analysis

Carrying out this analysis allows, in particular, to better classify and describe the data in the search tool. It can thus respond to the correct requested information in the catalog. The example is given in the Google Goods reference article [3]: Suppose the schema of a data set is known and certain fields of the schema take on integer values. Thanks to inference on the data set’s content, the user can identify that these integer values are IDs of known geographical landmarks and then use this type of content semantics to improve geographical data research in the tool.

Google Goods features

Google Goods catalogs and analyzes the data to present it in a unified manner. The tool collects the basic metadata and tries to enrich them by analyzing a number of parameters. By repeatedly revisiting data and metadata, Goods is able to enrich itself and evolve.

The main functions offered to users are:

A search engine

Like the Google we know, Goods offers a keyword search engine to query a dataset. This is the moment when the challenge of data prioritization is taking place. The search engine offers data classified according to different criteria such as the number of processing chains involved, the presence, or the absence of a description, etc.

Data presentation page

Each data has at its disposal a page containing as much information as possible. In consideration that certain data can be linked to thousands of others, Google compresses data upstream recognized as most crucial to make them more comprehensible on a presentation page. If the compressed version remains too large, the information presented keeps only the more recent entries.

Team boards

Goods created boards to distribute all data generated by a team. For example, this makes it possible to obtain different metrics and to connect with other boards. The board is updated each time Goods adds metadata. The board can be easily integrated into different documents so that teams can then share it.

In addition, it is also possible to implement monitoring actions and alerts on certain data. Goods is in charge of the verifications and can notify the teams in case of an alert.

Goods usage by Google employees

Over time, Google’s teams have come to realize the use of its tool as well its scope was not necessarily what the company expected.

Google was thus able to determine that employees’ principal uses and favorite features of Goods were:

Audit protocol buffers

Protocol Buffers are serialization formats with an interface description language developed by Google. It is widely used at Google for storing and exchanging all kinds of information structures.

Certain processes contain personal information and are a part of specific privacy policies. The audit of these protocols makes it possible to alert the owners of these data in the event of a breach of confidentiality.

Data recuperation

Engineers are required to generate a lot of data in the framework of their tests and often forget their location when they need to access it again. Thanks to the search engine, they can easily find them.

Understanding legacy code

It isn’t easy to find up-to-date information on the code or data sets. Goods manages the graphics that engineers can use to track previous code executions as well as the input and output of data sets and find the logic that links them.

Utilization of the annotation system

The bookmark system of data pages is fully integrated to find important information quickly and to easily share them.

Use of page markers

It’s possible to annotate data and attribute different degrees of confidentiality to them. This is so that others at Google can better understand the data they have in front of them.

With Goods, Google achieves prioritizing and unifying data access for all their teams. The system is meant to be non-intrusive and therefore operates continuously and invisibly for users in order to provide them with organized and explicit data. Thanks to this, the company improves team performance, avoiding redundancy. It saves on resources and accelerates access to data essential to the company’s growth and development.

[1] Moderator’s blog: https://www.blogdumoderateur.com/chiffres-google/
[2] Web Rank Info: https://www.webrankinfo.com/dossiers/google/chiffres-cles
[3] https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/45390.pdf

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Metacat: Netflix makes their Big Data accessible and useful

by Zeenea Software | Mar 29, 2019 | Data Inspiration

Like many numerous companies, Netflix has a colossal amount of data that come from many different data sources in various formats. As the leading streaming video on demand company (SVOD), data exploitation is, of course, a major strategic asset. Given the diversity of its data sources, the streaming platform wanted a way to federate and interact with these assets using a single tool. This led to the creation of Metacat.

This article explains the motivations behind the creation of Metacat, a metadata solution intended to facilitate the discovery, treatment, and management of Netflix’s data.

Read our previous articles on Google and AirBnB.

Netflix’s key figures

Netflix has come a long way since its DVD rental company in the 1990s. Video consumption on Netflix accounts for 15% of global internet traffic. But Netflix today is also:

130 million paying subscribers worldwide (400% increase since 2011)
$10 billion turnover, including $403 million in profits
$100 billion market capitalizations, or the sum of all the leading television groups in Europe
$6 billion investment in original creations (TV shows and movies).

Netflix is also a data warehouse of 60 petabytes (60 million billion bytes), which is a real challenge for the firm to exploit and federate these data.

Netflix’s Big Data platform architecture

Its basic architecture includes three key services. These are the Execution Service (Genie), the Metadata Service (Metacat), and the Event Service (Microbot).

data sources netflix metacat

In order to operate between its different languages and data sources, which are not very compatible with each other, Metacat was born. This tool acts as a data and metadata access layer from Netflix’s data sources. A centralized service accessible by any data user in order to facilitate their discovery, treatment, and management.

Metacat & its features

Netflix has data queries, such as Hive, Pig, or Spark, that are not operable together. By introducing a common abstraction layer, Netflix can provide data access to its users, regardless of their storage systems.

In addition, Metacat goes so far as to simplify transferring one dataset to a datastore to another.

Business metadata

Hand-written, user-defined, business-oriented metadata, in free format can be added via Metacat. Its main information includes the connections, configurations, metrics, and the life cycles of each dataset.

Data discovery

By creating Metacat, Netflix makes it easy for consumers to find business datasets. The tool publishes schema and business metadata defined by its users in Elasticsearch, making it easier to find full-text information in its data sources.

Data modification and audit

As a cross-functional tool for all data stores, Metacat registers and notifies all changes made to the metadata and the data itself from its storage systems.

Metacat and the future of Netflix

According to Netflix, the current version of Metacat is a step towards the new features they are working on. They still want to improve the visualization of their metadata, as it would be very useful for restoration purposes.

Metacat, according to Netflix, should also be able to have a plug-in architecture. Thus, their tool could validate and maintain all of its metadata. This is because users define metadata in free form. Therefore, Netflix needs to put into place a validation process that can be done before storing the metadata.

As a centralizing tool for multi-source and multi-format data, Netflix’s Metacat has clearly made progress.

The development of this in-house service has adapted to all the tools used by the company, allowing Netflix to become Data Driven.

Sources

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Metadata management : a trending topic in the data community

by Zeenea Software | Mar 28, 2019 | Metadata Management

On the 4th, 5th and 6th of March, Zeenea had the opportunity to attend the famous Data & Analytics Summit in London organized by Gartner. This is an indispensable and inspiring event for Chief Data Officers and their teams in the implementation of their data strategy.

This article outlines many concepts from the conference: “Metadata Management is a Must-Have Discipline” by Alan Dayley, Gartner Analyst. This subject has attracted the attention of many C-Levels, confirming that metadata management is a top priority for the years, even months, to come.

The concept of metadata applied to our daily lives

To introduce the concept of metadata, the speaker made an analogy to a situation that is known to all of us and that is becoming more and more important in our daily lives: to identify and select what we eat.

Take the example of a meal composed of many different ingredients that have been significantly modified. It’s thanks to the different labels, pricing schemes, and descriptions on a product’s packaging that consumers are able to identify what they have on their plates.

This information is what we call, metadata!

How do metadata bring value to an enterprise?

Applying metadata on data allows the enterprise to contextualize its data assets. Metadata addresses different subjects, gathered within four different categories: Data Trust, Regulations & Privacy, Data Security, and Data Quality.

The implementation of a metadata management strategy depends on finding the balance between the identified business needs within the company and the regulations associated with data risks.

In other words, where should you invest your time and money? Should you democratize data access to your data teams (data scientists, data engineers, data analysts or data experts) to increase in productivity or to concentrate on the demands of regulatory bodies such as the GDPR, to avoid a hefty fine?

The answer to these questions is specific to each enterprise. Nevertheless, Alan Dayley highlights four use cases, identified as top priority cases by CDOs, where metadata management should be the key:

1. Data governance

In this particular use case, the speaker confirms that data governance can no longer be thought of in a “top-down” manner. Data cross-references different teams and profiles with distinct roles and responsibilities. In light of this, everyone must work together to inform and complete their data’s information (its uses, its origin, its process, etc.). Contextualizing data is a fundamental element to establishing effective and easy data governance!

2. Risk management and compliance

The information requested below have been enforced since the arrival of the GDPR. Enterprises and their CDOs must:

Define the responsibilities linked to their data sets.
Map their data sets.
Understand and identify the processing operations on the data and associated risks.
Have a processing and/or a data lineage register.

3. Data analysis

By addressing data governance in a more collaborative way and by favoring interactions between data users, the enterprise will benefit from collective intelligence and continuous improvement on the understanding and analysis of a data set. In other words, it’s extracting previous discoveries and experimentations from pertinent information for the next data users.

4. Data value

In the quest for data monetization, data will have no value, so to speak, unless the information around it is:

measured: by its quality, its economic characteristics, etc.
managed: the persons in charge, documentation provided, its updates, etc.

How to establish metadata management?

No matter your enterprise’s objectives, you can not reach them without metadata management. Therefore, the answer to those questions is indeed metadata!

Our recommendations to be able to undertake this exercise would be to:

Hire the right sponsor that values a metadata-centric approach in the enterprise.
Identify the main use case that you want to treat first (as defined above).
Check that the efforts made in terms of metadata are not isolated but are centralized and unified.
Select a key metadata management solution on the market, such as a data catalog.
Define where, who, and how you will start.

To conclude this article, not having metadata management is like driving on a road with no signs. Be careful not to get lost!

Start metadata management

What are the different types of metadata?

by Zeenea Software | Feb 19, 2019 | Data Catalog, Metadata Management

Dealing with large volumes of data is essential to any organization’s success. But knowing what kind of data it is, where it comes from, and how it can be used is just as important. This is the role of metadata. So how can companies optimize and enhance it? Follow this guide.

Data is essential to have in-depth knowledge of an organization’s market, industry, customers, or products. But to exploit the full potential of this data, it is essential to focus on its metadata. This data on data is a prerequisite for knowing how to best use it. By having a precise vision of what made it possible to generate the data, at what time, via which source, it is possible to contextualize this information. Metadata is, in a way, structured information that describes, explains, locates, or facilitates access, use, or management of an information source.

However, the role of metadata is not limited to understanding the origin of data.

Properly managed and structured, metadata will also allow organizations to know how to get the most out of the information they have according to the objectives they’ve set.

How is metadata useful?

Metadata is everywhere. Not just in client files or in website archives. When taking a picture with a smartphone, metadata is instantly attached to the image: date, time, location… All this information can be valuable when wanting to create a virtual photo album for example.

It’s the same in the context of a company’s data project!

While metadata is necessary to truly understand where data comes from and how it can be used, it is not the only thing it is used for. In fact, when properly managed, metadata is a major lever for organizations seeking to structure and enhance their information on a daily basis. Optimal metadata management is therefore the foundation of a data-driven transformation project.

The different types of metadata

If the use of the generic term for metadata is to qualify the information relative to data, it is important to know that they can be classified into different types.

Thus, it is important to distinguish between descriptive metadata, which presents a resource in a way that facilitates the identification of the available data, and structural metadata. The latter provides information on the composition or organization of a data resource. To describe a data portfolio, there is also administrative metadata, which provides information on the date of creation or acquisition of the data, but also on its associated permissions, lifespan, and use.

Alongside this generic metadata is a wide range of other types of metadata. They provide context on the application and business uses of information, on technical aspects, or even reinforce an information’s descriptive dimension.

The larger the volume of data you have, the more varied the sources of data acquisition and collection are, and the more companies will benefit from fine-tuned metadata management.

What tools manage metadata?

To organize and optimize metadata use for all employees, it is essential to use a Data Catalog. Through this metadata management solution, organizations are able to index their data and metadata as well as quickly identify the sources of information that are available to data teams. But a Data Catalog’s mission goes even further: it will enable companies to reference all their data assets, facilitate data access when needed, and perform thematic searches.

Indeed, the quality of this metadata conditions the quality of a data description, with a direct impact on its visibility and ease of use.

At Zeenea, we’ve identified three types of metadata within our data catalog:

Technical metadata: they describe the structure of a dataset and the information related to storage systems.

Business metadata: it applies business context to datasets: descriptions (context and usage), owners and referents, tags, and properties in order to create a taxonomy over the datasets that will be indexed by our search engine. Business metadata are also present at the schema level of a dataset: descriptions, tags, or data confidentiality level per column.

Operational metadata: this allows us to understand when and how the data was created or transformed: statistical analysis of the data, date of update, origin (lineage), volume, cardinality, the identifier of the processes that created or transformed the data, status of the processes on the data, etc.

Data portal, the data centric tool of AirBnB

by Zeenea Software | Feb 18, 2019 | Data Inspiration, Metadata Management

AirBnB is a burgeoning enterprise. To keep pace with their rapid expansion, AirBnB needed to really think about data and the extension of its’ operation. The Data Portal was born from this growing momentum, a fully Data-Centric tool at the disposal of employees.

This article is the first of a series dedicated to Data-Centric enterprises. We will shed light on successful examples of the democratization and the mastery of data within inspiring organizations. These pioneering enterprises demonstrate the ambition of Zeenea’s data catalog: to help each structure to better understand use their data assets.

Airbnb today:

In a few years, AirBnB has secured their position as a leader of the collaborative economy around the world. Today they are among the top hoteliers on the planet. In numbers [1], they represent:

3 million recorded homes,
65,000 registered cities,
190 countries with AirBnB offers,
150 million users.

France is its’ second largest market behind the United States. It alone counts for more than 300,000 homes.

The reflections that led to the Data Portal

During a conference held in May 2017, John Bodley, a data engineer at AirBnB, outlined new issues arising from the high growth of collaborators (more than 3,500) and the massive increase in the amount of data, from both users as well as employees (more than 200,000 tables in their Data Warehouse). This is a confusing and divided landscape that doesn’t always allow access to increasingly important information.

How to combine success with a very real management problem with data? What to do with all this information collected daily and this knowledge both at the user and collaborator level? How can they be transformed into a force for all airbnb employees?

Here are the questions asked that led to the creation of the data portal.

Beyond these challenges, a problem of overall vision has been imposed on the company.

Since its creation in 2008, AirBnB has always paid great attention to their data and their operations. This is why a dedicated team has positioned themselves for the battle to develop a tool that democratizes data access within the enterprise. Their work is simultaneously founded on analysts’ knowledge and their ability to understand the critical points as well as on their engineers who also offer a more concrete vision of the whole. At the heart of the project, an in-depth survey of employees and of their problems were conducted.

From this survey, one constant emerged: a difficulty of finding information, which the collaborators need in order to work. The presence of tribal knowledge, kept by a certain group of people, is both counter-productive and unreliable.

The result: The necessity of raising questions to colleagues, the lack of trust in the information (data’s validity, impossible to know if the data is up-to-date) and consequently, the creation of new, but duplicate data, which astronomically increases the already existing quantity.

To respond to these challenges, AirBnB created the Data Portal and released it to the public in 2017.

Data Portal, Airbnb’s data catalog

To give you a clear picture, the Data Portal could be defined as a cross between a search engine and a social network.

It was designed to centralize absolutely all incoming data, whether they come from employees or users, by the enterprise. The goal of the Data Portal is to be able to return this information, in graphic form, to whichever employee needs it.

This self-service system allows collaborators to access necessary information by themselves for the development of their projects. Beyond data itself, the Data Portal lets you obtain contextualized metadata. The information is provided with a background that allows you to valorize the data better and to understand it as a whole.

The Data Portal was designed in a collaborative approach.

With this in mind, it helps you to visualize within data all the interactions between the different collaborators of the enterprise. Thus, it is possible to know who is connected to which data.

The Data Portal and a few of its features

The Data Portal offers different features to access data in a simple and fun way, offering the user an optimal experience. You can see pages dedicated to each data set or a significant amount of metadata linked to it.

Research: Chris Williams, an engineer and a member of the team in charge of developing the tool, speaks of a “Google-esque” feature. The search page allows you to quickly access data, to graphics, and also to the people, groups, or relevant teams behind the data.
Collaboration: All in one sharing approach and implementing a collaborative tool, data can be added to a user’s favorites, pinned on a team’s board, or shared via an external link. Just like a social network, each employee also has a profile page. As the tool is accessible to all collaborators and intended to be completely transparent, it also includes all the members in the hierarchy. Former employees continue to have a profile with all created and used data. Always in a logic of information decompartmentalization and doing away with tribal knowledge.
Lineage: It is also possible to explore data’s hierarchy by viewing both parent and child data.
Groups: Teams spend a lot of time exchanging around the same data. To enable each to share information more quickly and more easily, the possibility to create working groups was implemented in the Data Portal. Thanks to these pages, a team’s members can organize their data, easily access them, and encourage sharing.

Within the tool

Democratizing data has several virtues. First off, this avoids creating dependence on information. An umbrella system weakens the enterprise’s equilibrium. If the information and the understanding of data are only held by one group of people, the dependency ratio becomes too high.

In addition, it is important to simplify the understanding of data so that the collaborators can operate them better.

Globally speaking, the challenge for AirBnB is also to improve the trust in data for all their collaborators. So that each can be assured they are working with the correct information, updated, etc.

AirBnB is no fool and the team behind the Data Portal knows that the handling of this tool and its wise utilization will take time. Chris Williams put it this way: “Even if asking a colleague for information is easy, it is totally counterproductive on a larger scale.”

To change these habits, take the first step to consult the portal rather than directly exchanging will require a little effort from collaborators.

The vision of the Data Portal over time

To promote trust in the supplied data, the team wants to create a system of data certification. It would make it possible to certify both the data and the person who initiated the certification. Certified content will be highlighted in the search results.

Over time, AirBnB hopes to develop this tool at different levels:

Analysis of the network in order to identify obsolete data.
Create alerts and recommendations. Always with an explorative approach, the tool could possibly become more intuitive suggesting new content or updates on data accessed by a user.
Making data pleasant. To create an appealing setting for the employees by presenting, by example, the most viewed chart of the month, etc.

With the Data Portal, AirBnB pushes the use of data to the highest level.

The democratization of all employees makes it possible to make them more autonomous and efficient in their work and also reconstructs the enterprise’s hierarchy. And with more transparency, it will also become less dependent. The collaborative takes precedence over the notion of dedicated services. And the use of data reinforces enterprises’ strategy for their future development. A logical approach that it is a part of and is promoted among their customers.

Sources

[1] https://www.usine-digitale.fr/article/le-succes-insolent-d-airbnb-en-5-chiffres-cles.N512814
[2] Slides issues de la conférence « Democratizing Data at AirBnB » du 11 mai 2017 : https://www.slideshare.net/neo4j/graphconnect-europe-2017-democratizing-data-at-airbnb
https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770

Democratizing Data at Airbnb from Neo4j

https://searchcio.techtarget.com/feature/Airbnb-capitalizes-on-nearly-decade-long-push-to-democratize-data

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

download our white paper

Data mapping: The challenges in an organization

by Zeenea Software | Jul 3, 2018 | Data Catalog, Data governance, Metadata Management

The arrival of Big Data did not simplify how enterprises work with data. The volume, the variety, and the various data storage systems are exploding.

To prove this, Matt Turck published what we call the Big Data Landscape. Updated every year, this infographic shows the different key players in various sub-domains of the Big Data landscape.

Thus, with the Big Data revolution, it is even more difficult to answer “primary” questions related to data mapping:

What are the most pertinent datasets and tables for my use cases and my organization?
Do I have sensitive data? How are they used?
Where does my data come from? How have they been transformed?
What will be the impacts on my datasets if they are transformed?

>> Download our toolkit: Metamodel template <<

So many questions that information systems managers, Data Lab managers, Data Analysts or even Data Scientists ask themselves to be able to deliver efficient and pertinent data analysis.

Among others, these questions allow enterprises to:

Improve data quality: Providing as much information as possible allows users to know if the data is suitable for use.
Comply with European regulations (GDPR): mark personal data and the carried out processes.
Make employees more efficient and autonomous in understanding data through graphical and ergonomic data mapping.

To put these into action, companies must build what is called data lineage.

TECHNOLOGY

SOLUTIONS

CAPABILITIES

APPLICATIONS

INDUSTRIES

DATA LEADERS

KNOWLEDGE HUB

PRODUCT HUB

ABOUT

GET IN TOUCH

SERVICES

BELIEFS

About Actian

About HCLSoftware

For further details, please contact:

The benefits of AI in data cataloging

Automated metadata generation

Simplified data classification and tagging

Enhanced Search capabilities

Robust Data lineage and governance

Intelligent Recommendations

Anomaly Detection

The challenges and considerations of AI in data cataloging

Data Privacy and Security

Scalability

Data Integration

Data Product Shopping in Zeenea

Enhance Data Product performance with Zeenea Studio

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Metadata management in the context of an internal marketplace fed by domain-specific catalogs

Data catalog vs EDM capabilities

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Three main options for setting up an internal data marketplace

Develop it

Integrate a solution from the market

Use existing systems

The drawbacks of commercial marketplaces

Zeenea’s Enterprise Data Marketplace

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

Sharing and exploiting data products through metadata

Using a data marketplace to deploy metadata

The Practical Guide to Data Mesh: Setting up and Supervising an enterprise-wide Data Mesh

The role of Metadata

Challenges of Managing Metadata in Data Mesh

Best Practices for Managing Data in Data Mesh

What is a data catalog and what are its benefits?

Why implement a business glossary and for what purpose

What are the differences with a data dictionary?

The definitions of metadata management & master data management

What are the differences between metadata management and master data management?

What do master data management and Metadata management have in common?