Machine Learning Data Catalogs: good but not good enough!

Machine Learning Data Catalogs: good but not good enough!

machine-learning-data-catalog

How can you benefit from a Machine Learning Data Catalog?

You can use Machine Learning Data Catalogs (MLDCs) to interpret data, accelerate the use of data in your organization, and link data to business results. 

We provide real-world examples of the smart features of a data catalog in our previous articles: 

It is clear that this data catalog specificity is a cornerstone in choosing the right data cataloguing solution. In fact, Forrester highlights exactly that in their latest report: “Now Tech: Machine Learning Data Catalogs, Q4 2020.” 

In this document, they cite Zeenea Data Catalog as one of the key Machine Learning Data Catalog vendors on the market! However, as data professionals, you are aware that the “intelligent” aspect of a data catalog is a good solution, but not enough for you to achieve your data democratization mission.

 

Machine Learning Data Catalog vs Smart Data Catalogs: what’s the difference?

The term “smart data catalog” has become a buzzword over the past few months. However, when referring to something being “smart” most people automatically think, and rightly so, of a data catalog with only Machine Learning capabilities.

We at Zeenea, do not believe that a smart data catalog is reduced to only having ML features! In fact, there are different ways to be “smart”. We like to refer to machine learning as an aspect, among others, of a Smart Data Catalog.

The 5 pillars of a smart data catalog can be found in its :

  • Design: the way users explore the catalog and consume information,
  • User experience: how it adapts to different user profiles,
  • Inventory: provides an intelligent and automatic way to inventory, 
  • Search engine: meets different expectations and gives intelligent suggestions, 
  • Metadata management: a catalog that marks up and links data together using ML features.

This conviction is detailed in our article: “A smart data catalog, a must-have for data leaders” which was also given last September at the Data Innovation 2020 by Guillaume Bodet, CEO of Zeenea.  

What is a knowledge graph and how can it empower data catalog capabilities?

What is a knowledge graph and how can it empower data catalog capabilities?

knowledge-graphs

Knowledge graphs have been interacting with us for quite some time. Whether it be through personalized shopping experiences via online recommendations on websites such as Amazon, Zalando, or through our favorite search engine Google.

However, this concept is still often a challenge for most data and analytics managers who struggle to aggregate and link their business assets in order to take advantage of them as do these web giants.

In fact, to support this claim, Gartner stated in their article “How to Build Knowledge Graphs That Enable AI-Driven Enterprise Applications” that:

“Data and analytics leaders are encountering increased hype around knowledge graphs, but struggle to find meaningful use cases that can secure business buy-in.”.

 In this article, we will define the concept of a knowledge graph by illustrating it with the example of Google and highlight how it can empower a data catalog.

 

What is a knowledge graph exactly?

According to GitHub, a knowledge graph is a type of ontology that depicts knowledge in terms of entities and their relationships in a dynamic and data-driven way. Contrary to static ontologies, who are very hard to maintain.

 Here are other definitions of a knowledge graph by various experts: 

  • A “means of storing and using data, which allows people and machines to better tap into the connections in their datasets.” (Datanami)
  • A “database which stores information in a graphical format – and, importantly, can be used to generate a graphical representation of the relationships between any of its data points.” (Forbes)
  • “Encyclopedias of the Semantic World.” (Forbes)

Through machine learning algorithms, it provides structure for all your data and enables the creation of multilateral relations throughout your data sources. The fluidity of this structure grows more as new data is introduced, allowing more relations to be created and more context to be added which helps your data teams to make informed decisions with connections you may have never found.

The idea of a knowledge graph is to build a network of objects, and more importantly, create semantic or functional relationships between the different assets. 

Within a data catalog, a knowledge graph is therefore what represents different concepts and what links objects together through semantic or static links.

Google example 

Google’s algorithm uses this system to gather and provide end users with information relevant to their queries.

Google’s knowledge graph contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects.

Their knowledge graph enhances Google Search in three main ways: 

  • Find the right thing: Search not only based on keywords but on their meanings.
  • Get the best summary: Collect the most relevant information from various sources based on the intent.
  • Go deeper and broader: Discover more than you expected thanks to relevant suggestions. 
knowledge-graph

How do knowledge graphs empower data catalog usages ?

Powered by a data catalog, knowledge graphs can benefit your enterprise in their data strategy through:

Rich and in-depth search results

Today, many search engines use multiple knowledge graphs in order to go beyond basic keyword-based searching. Knowledge graphs allow these search engines to understand concepts, entities and the relationships between them. Benefits include:

  • The ability to provide deeper and more relevant results, including facts and relationships, rather than just documents,

  • The ability to form searches as questions or sentences — rather than a list of words,

  • The ability to understand complex searches that refer to knowledge found in multiple items using the relationships defined in the graph.

Optimized data discovery

Enterprise data moves from one location to another in the speed of light, and is being stored in various data sources and storage applications. Employees and partners are accessing this data from anywhere and anytime, so identifying, locating and classifying your data in order to protect it and gain insights from it should be the priority!

The benefits of knowledge graphs for data discovery include:

  • A better understanding of enterprise data, where it is, who can access it and where, and how it will be transmitted,
  • Automatic data classification based on context,
  • Risk management and regulatory compliance,
  • Complete data visibility,
  • Identification, classification, and tracking of sensitive data,
  • The ability to apply protective controls to data in real time based on predefined policies and contextual factors
  • Adequately assess the full data picture.

On one hand it helps implement the appropriate security measures to prevent the loss of sensitive data and avoid devastating financial and reputational consequences for the enterprise. On the other, it enables teams to dig deeper into the data context to identify the specific items that reveal the answers and find ways to answer your questions.

Smart recommendations

As mentioned in the introduction, recommendation services are now a familiar component of many online stores, personal assistants and digital platforms.

The recommendations need to take a content-based approach. Within a data catalog, machine learning capabilities combined with a knowledge graph,  will be able to detect certain types of data, apply tags, or statistical rules on data to run effective and smart asset suggestions.

This capacity is also known as data pattern recognition. It refers to being able to identify similar assets and rely on statistical algorithms and ML capabilities that are derived from other pattern recognition systems.

This data pattern recognition system helps data stewards maintain their metadata management :

  • Identify duplicates and copy metadata
  • Detect logical data types (emails, city, addresses, and so on)
  • Suggest attribute values (recognize documentation patterns to apply to a similar object or a new one)
  • Suggest links – semantic or lineage links
  • Detect potential errors to help improve the catalog’s quality and relevance

The idea is to use some techniques that are derived from content-based recommendations found in general-purpose catalogs. When the user has found something, the catalog will suggest alternatives based both on their profile and pattern recognition. 

Some data catalog use cases empowered by knowledge graphs

  • Gathering assets that have been used or related to causes of failure in digital projects.
  • Finding assets with particular interests aligned with new products for the marketing department.
  • Generating complete 360° views of people and companies in the sales department.
  • Matching enterprise needs to people and projects for HRs.
  • Finding regulations relating to specific contracts and investments assets in the finance department.

Conclusion

With the never ending increase of data in enterprises, organizing your information without a strategy means not being able to stay competitive and relevant in the digital age. Ensuring that your data catalog has an enterprise Knowledge Graph is essential for avoiding the dreaded ‘black box’ effect.

Through a knowledge graph in combination with AI and machine learning algorithms, your data will have more context and will enable you to not only discover deeper and more subtle patterns but also to make smarter decisions. 

For more insights on what is a knowledge graph, here is a great article by Gartner Analyst Timm Grosser “Linked Data for Analytics?

Start your data catalog journey with Zeenea

Zeenea is a 100% cloud-based solution, available anywhere in the world with just a few clicks. By choosing Zeenea Data Catalog, control the costs associated with implementing and maintaining a data catalog while simplifying access for your teams.

The automatic feeding mechanisms, as well as the suggestion and correction algorithms, reduce the overall costs of a catalog, and guarantee your data teams with quality information in record time. 

A smart data catalog, a must-have for data leaders

A smart data catalog, a must-have for data leaders

smart data catalogs

The term “smart data catalog” has become a buzzword over the past few months. However, when referring to something being “smart” most people automatically think, and rightly so, of a data catalog with only Machine Learning capabilities.

We at Zeenea, do not believe that a smart data catalog is reduced to only having ML features!

In fact, there are many different ways to be “smart”. 

This article focuses on the conference that Guillaume Bodet, co-founder and CEO of Zeenea, gave at the Data Innovation Summit 2020: “Smart Data Catalogs, A must-have for leaders”.

A quick definition of data catalog

We define a data catalog as being:

A detailed inventory of all data assets in an organization and their metadata, designed to help data professionals quickly find the most appropriate data for any analytical business purpose.

A data catalog is meant to serve different people, or end-users. All of these end-users have different expectations, needs, profiles, and ways to understand data. These end-users consist of data analysts, data stewards, data scientists, business analysts, and so much more. As more and more people are using and working with data, a data catalog must be smart for all end-users.

Click here for a more in-depth article on what is a data catalog

What does a “data asset” refer to?

An asset, financially speaking, typically appears in the balance sheet with an estimation of value. When referring to data assets, it is just as important, even more important in some cases, than other enterprise assets. The issue is that the value for data assets aren’t always known. 

However, there are many ways to tap the value of your data. There is the possibility for enterprises to directly use their data’s value, like for example selling or trading their data. Many organizations do this; they clean the data, structure it, and then proceed to sell it.

Enterprises can also make value indirectly from their data. Data assets enable organizations to:

  • Innovate for new products/services
  • Improve overall performance
  • Improve product positioning
  • Better understand markets/customers
  • Increase operational efficiency

High performing enterprises are those that master their data landscape and exploit their data assets in every aspect of their activity

The hard things about data catalogs…

When your enterprise deals with thousands of data, that usually means you are dealing with possibly

  • 100s of systems that store internal data (data warehouses, applications, data lakes, datastores, APIs, etc) as well as external data from partners.
  • 1,000s of datasets, models, and visualizations (data assets) that are composed of thousands of fields.
  • And these fields contain millions of attributes (or metadata)!

Not to mention the hundreds of users using them…

This raises two different questions.  

How can I build, maintain, and enforce the quality of my information for my end-users to trust in my catalog?

How can I quickly find data assets for specific use cases?

The answer is in smart data catalogs!

At Zeenea, we believe that are five core areas of “smartness” for a data catalog. It must be smart in its:

  • Design: the way users explore the catalog and consume information,
  • User experience: how it adapts to different profiles,
  • Inventories: provides a smart and automatic way of inventorying,
  • Search engine: supports the different expectations and gives smart suggestions,
  • Metadata management: a catalog that tags and links data together through ML features. 

Let’s go into detail for each of these areas.

A smart design

Knowledge graph

A data catalog with smart design uses knowledge graphs rather than static ontologies (a way to classify information, most of the time built as a hierarchy).  The problem with ontologies is that they are very hard to build and maintain, and usually only certain types of profiles truly understand the various classifications.

A knowledge graph on the other hand, is what represents different concepts in a data catalog and what links objects together through semantic or static links. The idea of a knowledge graph is to build a network of objects, and more importantly, create semantic or functional relationships between the different assets in your catalog.

Basically, a smart data catalog provides users with a way to find and understand related objects.

Adaptive metamodels

In a data catalog, users will find hundreds of different properties, to which aren’t relevant to some users. Typically, two types of information are managed:

  1. Entities: plain objects, glossary entries, definitions, models, policies, descriptions, etc.
  2. Properties: the attributes that you put on the entities (any additional information such as create date, last updated date, etc.)

The design of the metamodel must serve the data consumer. It needs to be adapted to new business cases and must be simple enough to manage for users to maintain and understand it. Bonus points if it is easy to create new types of objects and sets of attributes!

 

Semantic attributes

 Most of the time, in a data catalog, the metamodel’s  attributes are technical properties. Some of the attributes on an object include generic types such as text, number, date, list of values, and so on. As this information is necessary to have, it is not completely sufficient because they do not have information on the semantics, or meaning. The reason this is important is because with this information, the catalog can adapt the visualization of the attribute and improve suggestions to users.

In conclusion, there is one size fits all to a data catalog’s design, and it must evolve in time to support new data areas and use cases.

knowledge-graph

A smart user experience

 As stated above, a data catalog holds a lot of information and end-users often struggle to find the information of interest to them. Expectations differ between profiles! A data scientist will expect statistical information, whereas a compliance officer expects information on various regulatory policies. 

With smart and adaptive user experience, a data catalog will present the most relevant information to specific end-users. Information hierarchy and adjusted search results in a smart data catalog is based on:

  • Static preferences: already known in the data catalog if the profile is more focused on data science, IT, etc.
  • Dynamic profiling: to learn what the end-user usually searches, their interests, and how they’ve used the catalog in the past.

A smart inventory system

A data catalog’s adoption is built on trust – and trust can only come if its content is accurate. As the data landscape moves at a fast pace, it must be connected to operational systems to maintain the first level of information on metadata on your data assets.

The catalog must synchronize its content with the actual content of the operational systems.

A catalog’s typical architecture is to have scanners that scan your operational systems and bring and synchronize information from various sources (Big Data, noSQL, Cloud, Data Warehouse, etc.). The idea is to have universal connectivity so enterprises can scan any type of system automatically and set them in the knowledge graph.

In Zeenea, there is an automation layer to bring back the information from the systems to the catalog. It can:

  • Update assets to reflect physical changes
  • Detect deleted or moved assets
  • Resolve links between objects
  • Apply rules to select the appropriate set of attributes and define attribute values 
smart-inventorying-zeenea

 A smart search engine

In a data catalog, the search engine is one of the most important features. We distinguish between two kinds of searches:

  • High intent search: the end-user already knows what they are looking for and has precise information on their query. They either already have the name of the dataset or already know where it is found. Low intent searches are commonly used by more data savvy people.
  • Low intent search: the end-user isn’t exactly sure what they are looking for, but want to discover what they could use for their context. Searches are made through keywords and users expect the most relevant results to appear. 

 A smart data catalog must support both types of searches!

It must also provide smart filtering. It is a necessary complement to the user’s search experience (especially low intent research), allowing them to narrow their search results by excluding attributes that aren’t relevant. Just like many big companies like Google, Booking.com, and Amazon, the filtering options must be adapted to the content of the search and the user’s profile in order for the most pertinent results to appear. 

Smart metadata management

 Smart metadata management is usually what we call the “augmented data catalog”, the catalog that has machine learning capabilities that will enable it to detect certain types of data, apply tags, or statistical rules on data.

A way to make metadata management smart is to apply data pattern recognition. Data pattern recognition refers to being able to identify similar assets and rely on statistical algorithms and ML capabilities that are derived from other pattern recognition systems.

This data pattern recognition system helps data stewards set their metadata:

  • Identify duplicates and copy metadata
  • Detect logical data types (emails, city, addresses, and so on)
  • Suggest attribute values (recognize documentation patterns to apply to a similar object or a new one)
  • Suggest links – semantic or lineage links
  • Detect potential errors to help improve the catalog’s quality and relevance

It also helps data consumers find their assets. The idea is to use some techniques that are derived from content-based recommendations found in general-purpose catalogs. When the user has found something, the catalog will suggest alternatives based both on their profile and pattern recognition.  

Start your data catalog journey with Zeenea

Zeenea is a 100% cloud-based solution, available anywhere in the world with just a few clicks. By choosing Zeenea Data Catalog, control the costs associated with implementing and maintaining a data catalog while simplifying access for your teams.

The automatic feeding mechanisms, as well as the suggestion and correction algorithms, reduce the overall costs of a catalog, and guarantee your data teams with quality information in record time. 

Data science: accelerate your data lake initiatives with metadata

Data science: accelerate your data lake initiatives with metadata

Data lakes offer an unlimited storage for data and present lots of potential benefits for data scientists in the exploration and creation of new analytical models. However, this structured, unstructured and semi-structured data are mashed together and the business insights they contain are often overlooked or misunderstood by data users.

The reason for this is that many technologies used to implement data lakes lack the necessary information capabilities that organizations usually take for granted. It is therefore necessary for these enterprises to manage their data lakes by putting in place effective metadata management which considers metadata discovery, data cataloguing, and overall enterprise metadata management applied to the company’s data lake.

2020 is the year that most data and analytics use cases will require connecting to distributed data sources, leading enterprises to double their investments in metadata management. – Gartner 2019.

How to leverage your data lake with metadata management

To get value from your data lake, it is essential for companies to have both skilled users (such as data scientists or citizen data scientists) and effective metadata management for their data science initiatives. To begin with, an organization could focus on a specific dataset and its related metadata. Then, leverage this metadata as more data is added into the data lake. Setting up metadata management can make it easier for data lake users to initiate this task.

Here are the areas of focus for successful metadata management in your data lake:

 

Creating a metadata repository

Semantic tagging is essential for discovering enterprise metadata. Metadata discovery is defined as the process of using solutions to discover the semantics of data elements in datasets. This process usually results in a set of mappings between different data elements in a centralized metadata repository. This allows data science users to understand their data and have visibility on whether or not they are clean, up-to-date, trustworthy, etc.

 

Automating metadata discovery

As numerous and diverse data gets added to a data lake on a daily basis, maintaining ingestion can be quite a challenge! By using automated solutions not only does it make it easier for data scientists or CDS to find their information but it also supports metadata discovery.

 

Data cataloguing

A data catalog consists of metadata in which various data objects, categories, properties and fields are stored. Data cataloguing is both used for internal and external data (from partners or suppliers for example). In a data lake, it is used for capturing a robust set of attributes for every piece of content within the lake and enriches the metadata catalog by leveraging these information assets. This enables data science users to have a view into the flow of the data, perform impact analysis, have a common business vocabulary and accountability and an audit trail for compliance.

 

Data & Analytics Governance

Data & analytics governance is an important use case when it comes to metadata management. Applied to data lakes, the question “could it be exposed?” must become an essential part of the organization’s governance model. Enterprises must therefore extend their existing information governance models to specifically address business analytics and data science use cases that are built on the data lakes. Enterprise metadata management helps in providing the means to better understand the current governance rules that relate to strategic types of information assets.

Contrary to traditional approaches, the key objective of metadata management is to drive a consistent approach to the management of information assets. The more metadata semantics are consistent across all assets, the greater the consistency and understanding, allowing the leveraging of information knowledge across the company. When investing in data lakes, organizations need to consider an effective metadata strategy for those information assets to be leveraged from the data lake.

 

Start metadata management with Zeenea

As mentioned above, implementing metadata management into your organization’s data strategy is not only beneficial, but essential for enterprises looking to create business value with their data. Data science teams working with various amounts of data in a data lake need the right solutions to be able to trust and understand their information assets. To support this emerging discipline, Zeenea gives you everything you need to collect, update and leverage your metadata through its next generation platform!

DataOps: How data catalogs enable better data discovery in a Big Data project

DataOps: How data catalogs enable better data discovery in a Big Data project

In today’s world, Big Data environments are more and more complex and difficult to manage. We believe that Big Data architectures should, among other things:

  • Retrieve information on a wide spectrum of data,
  • Use advanced analytics techniques such as statistical algorithms, machine learning and artificial intelligence,
  • Enable the development of data oriented applications such as a recommendation system on a website.

In order to put in place a successful Big Data architecture, enterprise data are stored in a centralized data lake, destined to serve various purposes. However, the massive & continuous amount of diverse & varied data from different sources transforms a data lake into a data swamp. So as business functions are increasingly working with data, how can we help them find their way?

In order for your Big Data to be exploited at their full potential, your data must be well documented.

Data documentation is key here. However, documenting data such as their business name, description, owner, tags, level of confidentiality, etc can be an extremely time consuming task, especially with millions of data available in your lake!

With a DataOps approach, an agile framework focused on improving communication, integration and automation of data flows between data managers and data consumers across an organization, enterprises are able to carry out their projects in an incremental manner. Supported by a data catalog solution, enterprises are able to easily map and leverage their data assets, in an agile, collaborative and intelligent manner.

 

How does a data catalog support a DataOps approach in your Big Data project?

Let’s go back to the basics… what is a data catalog?

A data catalog automatically captures and updates technical and operation metadata from an enterprise’s data sources and stores them in a unique source of truth. It’s purpose is to democratize data understanding: to allow your collaborators to find the data they need via one easy-to-use platform above data systems. Data catalogs don’t require technical expertise to actually discover what is new and seize opportunities!

 

Effective data lake documentation for your Big Data

Think of Legos. Legos can be created and built into anything you want, but at its core, Legos are still just a set of bricks. Theses blocks can be shaped to any need, desire or resource!

In your quest to facilitate your data lake journey, it is important to create effective documentation through the following:

  • Customizable layouts,
  • Interactive components,
  • A set of pre-created templates.

By offering modular templates, Data Stewards can simply and efficiently configure documentation templates according to their business users’ data lake search queries.

Monitor Big Data with automated capabilities

Through an innovative architecture and connectors, data catalogs can connect to your Big Data sources, where the IT department can monitor their data lake. They are able to map new incoming datasets, be notified of any deleted or modified datasets or even report errors to referring contacts for example.

Users are able to access to up-to-date information in real time!

These automated capabilities allow users to be notified of when new datasets appear, when they are deleted, when there are errors, when they were last updated, etc.

 

Support Big Data documentation with augmented capabilities

Intelligent data catalogs are essential for data documentation. They rest on artificial intelligence and machine learning techniques, one being “fingerprinting” technology. This feature offers data users that are responsible for a particular data set some suggestions as for its documentation. These recommendations can, for example, be associated with tags, contacts, or even business terms of other data sets based on:

  • The analysis on the data itself (statistical analysis),
  • The schema resembling other data sets,
  • The links on the other data set’s fields.

An intelligent data catalog also detects personal/private data in any given data set and report it on its interface. This feature helps enterprises respond to the different GDPR demands put into place in May 2018, as well as alert potential users on a data’s sensitivity level.

 

Enrich your Big Data documentation with Zeenea Data Catalog

Enrich your data’s documentation with Zeenea! Our metadata management platform was designed for Data Stewards, and centralizes all data knowledge in a single and easy-to-use interface.

Automatically imported, generated, or added by the administrator, data stewards are able to efficiently document their data directly within our data catalog.

Give meaning to your data with metadata!

How you’re going to fail your data catalog project (or not…)

How you’re going to fail your data catalog project (or not…)

There are many solutions on the data catalog market that offer an overview of all enterprise data all thanks to the efforts conducted by data teams.

However, after a short period of use, due to the approaches undertaken by enterprises and the solutions that were chosen, data catalog projects often fall into disuse.

Here are some of the things that can make a data catalog project fail… or not!

Your objectives were not defined

Many data catalog projects are launched under a Big Bang approach, with the aim of documenting assets, but without truly knowing what their objectives are.

Fear not! In order to avoid bad project implementation, we advocate a model based on iteration and value generation. Conversely, this approach allows for better risk control and the possibility of a faster return on investment.

The first effects should be observable at the end of each iteration. In other words, the objective must be set to produce concrete value for the company, especially for your data users.

For example, if your goal is data compliance, start documentation focused on these properties and target a particular domain, geographic area, business unit, or business process.

Your troops’ motivation will wear off over time…

While it is possible to gain adherence and support regarding your company’s data inventory efforts in its early stages, it is impossible to maintain this support and commitment over time without automation capabilities.

We believe that descriptive documentation work should be kept to a minimum to keep your teams motivated. The implementation of a data catalog must be a progressive project and will only last if the effort required by each individual is greater than the value they will get in the near future.

You won’t have the critical mass of information needed

For a data catalog to bring value to your organization, it must be richly populated.

In other words, when a user searches for information in a data catalog, they must be able to find it for the most part.

At the start of your data catalog implementation project, the chances that the information requested by a user is not available are quite high.

However, this transition period should be as short as possible so that your users can quickly see the value generated by the data catalog. By choosing a tactical solution, based on its technology and connectivity to information sources, a pre-filled data catalog will be available as soon as it is implemented.

Does not reflect your operational reality

In addition to these challenges, data catalogs must have a set of automated features that are useful and effective over time. Surprisingly, many solutions do not have offer these minimum requirements for a viable project, and are unfortunately destined for a slow and painful death.

Connecting data catalogs to your sources will ensure that your data consumers :

 

  • Reliability as to the information made available in the data catalog for analysis and use in their projects.
  • Fresh information: are they up to date, in real time?

How does Zeenea Data Catalog empower your data teams?

How does Zeenea Data Catalog empower your data teams?

Data has become one of the main drivers for innovation for many sectors.

And as data continues to rapidly multiply, companies need to evolve and grasp new technologies to succeed in their data & analytics strategy. And this is where Zeenea Data Catalog comes in!

First of all, what are the problems of companies regarding their data?

As a leading data catalog solution for data-driven companies such as LCL, Société Générale, Renault, we always come across, among others, three main issues:

  • Lack of visibility: With many different data sources such as various data warehouses, data lakes, cloud data, etc, it becomes complicated for employees to find the relevant data and thus, transforming Big Data into Big Chaos! It becomes confusing and demotivating for data teams to work with their data, as they end up spending most of their time wondering where their data actually is, and if it is still reliable. 
  • Lack of knowledge: Most enterprises today have specific people or teams that handle data since data is usually a very technical, complicated subject for employees. This lack of sharing and communicating of data reduces the enterprise’s potential to produce more value, at a local level where any employee becomes a valuable asset and data silos disappear.
  • Lack of culture: As many companies have understood, it is essential to implement data culture within the organization to truly become data driven. A good change management is sought out with the right people, processes, and solutions that offer a way for companies to create data literacy and facilitate their data journey.

However, do not forget: with great data comes great responsibility! This refers to what we call a data democracy culture.

And there is an answer for all of these issues: a modern and smart data catalog software.

The choice of a data catalog in your company

As mentioned above, many leading companies have trusted Zeenea in their quest for implementing a data catalog solution. Choosing Zeenea is choosing:

  • An overview of all of an enterprise’s data assets through our connectors,
  • A Google-esque search engine that enables employees to intuitively search for a dataset, business term, or even a field from just a single keyword. Narrow your search with various personalized filters (reliability score, popularity, type of document, etc.).
  • A collaborative application that allows enterprises to become acculturated to data thanks to collective knowledge, discussions, feeds, etc,
  • A machine learning technology that notifies you and gives suggestions as to your catalogued data’s documentation,
  • A dedicated user experience that allows data leaders to empower their data explorers to become autonomous in their data journeys.

Learn more about our data catalog

Contact us for more information on our data catalog for teams

If you are interested in getting more information, getting a free personalized demo, or just want to say hi, do not hesitate to contact our team who will get back to you as soon as we’ve received your request 🙂

What is metadata management?

What is metadata management?

“By 2021, organizations will spend twice as much effort in managing metadata compared with 2018 in order to assess the value and risks associated with the data and its use.”

*Gartner, The State of Metadata Management

The definition of metadata management

As mentioned in our previous article “The difference between data and metadata”, metadata provides context to your data. And to trust your data’s context, you must understand it. Knowing the who, what, when, where, and why of your data means knowing your metadata, otherwise known as metadata management.

With the arrival of Big Data and the various regulations, data leaders must look further into their data through metadata. Metadata is created whenever data is created, added, deleted from, updated or acquired. For example, metadata in an Excel spreadsheet includes the date of creation, the name, the associated authors, the file size, etc. In addition, metadata could also include titles and comments made in the document.

In the past, a form of metadata management would be look up a book’s call number in a catalog to find its location in a library. Today, metadata management is used in software solutions to comply with data regulations, set up data governance as well as understand the data’s value. Thus, this discipline becomes essential for enterprises!

Why should you implement a metadata management strategy?

The first use case regarding metadata management is to facilitate the discovery and understanding of a person’s or program’s specific data asset.

This requires setting up a metadata repository, populating and generating easy to use information in it.

Here are, among others, benefits of metadata management:

    • A better understanding of the meaning of enterprise’s data assets,
    • More communication on a data’s semantics via a data catalog,
    • Data leaders are more efficient, leading to faster project delivery,
    • The use of data dictionaries and business glossaries allow the identification of synergies and the verification of coherent information,
    • Reinforcement of data documentation (deletions, archives, quality, etc.),
    • Generate audit and information tracks (risk and security for compliance).

Manage your metadata with Zeenea’s metadata management platform

With Zeenea, transform your metadata into exploitable knowledge! Our metadata management platform automatically curates and updates your information from your storage systems. It becomes a unique, up-to-date source of knowledge for any data explorer in the enterprise.

How to evaluate your future Data Catalog?

How to evaluate your future Data Catalog?

The explosion of data sources in organizations, the heterogeneity of data or even the new demands related to data makes it essential to maintain the documentation of your information! However, enterprises continue to use older, more “traditional” methods to inventory and understand these new assets.

It is for this reason that Data Catalog solutions appeared in the market.

As you’ve probably noticed… there are many Data Catalog solutions available in the market! This profusion of offers leaves companies uncertain about which Data Catalog will best meet their expectations.

That said, on which specificities should you evaluate your future Data Catalog? We believe that you should keep in mind in your shortlist of solutions, at a minimum, these five founding principles:

1. One Data Catalog for all and all for one Data Catalog

Implementing a data catalog means having a metadata management strategy at the enterprise level.

In other words, acquiring a Data Catalog at a data storage level would lead to recreating well-known “data silos,” but this time, relative to metadata. It makes it, therefore, difficult and complicated to manage and integrate them in other systems.

A Data Catalog must become a reference point within your enterprise. The solution must connect to all your data storage or information systems from the most cutting-edge to those more traditional.

2. From declaration to automation

 

When evaluating your solutions, think of choosing an automated data catalog. An essential brick in metadata management, this feature will simplify and automate your information’s inventory as well as update them from your different databases in your future data catalog.

This is a simple way to make available accurate information to your data users.

3. Simple!

Simple does not mean complete!

In this case, we think of the word simple for future data catalog users. A “well-made” interface thought for non-technical users will allow the organization to better adapt and retain to a solution.

4. Progressively deploy a solution with the right support

 

To best convince your users of using such a tool, not only evaluate its capacity, but also its support system offered by the software editor (or its partners) to help you put into place a metadata management strategy within your Data Catalog.

For example, at Zeenea, we work with our clients each step of the way, and with each metadata source to maximize the value of our solution (automation, search engine, collaboration, etc.) alongside a pilot population, growing over time.

5. From passive to active metadata

 

Your future Data Catalog should not be a simple inventory of information. Think about the range possibilities today with this raw material! By providing a solution offering machine learning, thanks to research, profiling metadata and / or data from the tool, enrich day-to-day documentation as well as the meaning and uses of your data assets.

Thus, transform your metadata into enterprise assets

What is the difference between a data dictionary and a business glossary?

What is the difference between a data dictionary and a business glossary?

 

In metadata management, we often talk about data dictionaries and business glossaries. Although they might sound similar, they’re actually quite different! Let’s take a look at their differences and relations.

What is a data dictionary?

A data dictionary is a collection of descriptions of data objects or items in a data model.

These descriptions can include attributes, fields, or even properties on their data such as their types, transformations, relations, etc.

Data dictionaries help data explorers better understand their data and metadata. Usually in the form of tables or spreadsheets, data dictionaries are a must have IT knowledge for technical users such as developers, data analysts, data scientists, etc.

 

What is a business glossary?

While data dictionaries are useful to technical users, a business glossary is meant to bring meaning and context to data in all departments of the enterprise.

A business glossary is therefore a place where business and/or data terms are defined. It may sound simple, however, it is rare that all employees in an organization share a common understanding of even basic terms such as “contact” and “customer”.

Example of a business glossary:

The main differences between data dictionaries and business glossaries are:

Data dictionaries deal with database and system specifications, mostly used by IT teams. Business glossaries are more accessible and standardize definitions for everyone in the organization.

Data dictionaries usually come in the form of schemas, tables, columns, etc. whereas a business glossary provides a unique definition for business terms in textual form.

A business glossary cross references terms and their relationships whereas data dictionaries do not.

What is the relation between a data dictionary and a business glossary?

The answer is simple: a business glossary provides meaning to the data dictionary.

For example, a US social security number (SSN) will be defined as “a unique number assigned by the US government for the purpose of identifying individuals within the US Social Security System” in the business glossary. In the data dictionary, the term SSN is defined as “a nine character string typically displayed with hyphens”.

If a data citizen ever has a doubt on what the term “SSN” means in the context of their data dictionary, they can always search for the associated business term inside the business glossary.

 

Interested in automating a data dictionary and building a business glossary for your enterprise?

Create a central metadata repository for all corporate data sources with our data catalog thanks to our connectors and APIs.

Our tool also provides a user friendly and intuitive way to build and import your business glossaries in order to link these definitions with any Zeenea’s concepts.

Create a single source of truth in your enterprise!

Data Revolutions: Towards a Business Vision of Data

Data Revolutions: Towards a Business Vision of Data

The use of massive data by the internet giants in the 2000s was a wake-up call for enterprises: Big Data is a lever for growth and competitiveness that encourages innovation. Today, enterprises are re-organizing themselves around their data in order to adopt a “data-driven” approach. It’s a story constituting several twists and turns that tends to finally find a solution.

This article discusses the different enterprise data revolutions undertaken in recent years up to now, in an attempt to maximize the business value of data.

Siloed architectures

In the 80s, Information Systems developed immensely. Business applications were created, advanced programming language emerged, and relational databases appeared. All these applications stayed on their owners’ platforms, isolated from the rest of the IT ecosystem. 

For these historical and technological reasons, an enterprise’s internal data were distributed in various technologies and in heterogeneous formats. In addition to organizational problems, we then speak of a tribal effect. Each IT department have their own tools and implicitly,  manage their own data for their own uses. We are witnessing a type of data hoarding within organizations. To back these suggestions, we frequently recall Conway’s law: “All architecture reflects the organization that created it.” Thus, this organization, called silos, makes for very complex and onerous cross-referencing of data originating from two different systems. 

The search for a centralized and comprehensive vision of an enterprise’s data will lead Information Systems to a new revolution. 

The concept of a Data Warehouse

By the end of the 90s, Business Intelligence was in full swing. For analytical purposes and with the goal of responding to all strategic questions, the concept of a data warehouse appeared. 

To make this, we will recover the data from mainframes or relational databases and transfer them to an ETL (Extract Transform Loader). Projected in a so-called pivot format, analysts and decision-makers can access data collected and formatted to answer pre-established questions and specific cases of reflection. From the question, we get a data model!

This revolution always comes with some problems…Using ETL tools has a certain cost, not to mention the hardware that comes with it. The elapsed time between the formalization of the need and the receipt of the report is time-consuming. It’s a revolution that is costly for perfectible efficiency.

The new revolution of a data lake…

The arrival of data lakes reverses the previous reasoning.  A data lake enables organizations to centralize all useful data storages, regardless of their source or format, for a very low cost. . We stock an enterprise’s data without presuming their usage in the treatment of a future use case. It is only according to a specific use where we will select these raw data and transform them into strategic information. 

We are moving from an “a priori” to an “a posteriori” logic. This revolution of a data lake focuses on new skills and knowledge: data scientists and data engineers are capable of launching the treatment of data, producing results much faster than the time spent using data warehouses. 

Another advantage of this Promised Land is its’ price. Often offered in an open-source way, data lakes are cheap, including the hardware that comes with them. We often speak of community hardware. 

… or rather a data swamp

Certain advantages are present with the data lake revolution but come along with new challenges. The expertise needed to instantiate and to maintain these data lakes are rare and thus, are costly for enterprises. Additionally, pouring data in a data lake day after day without efficient management or organization brings on the serious risk of rendering the infrastructure unusable. Data are then inevitably lost in the mass.

This data management is accompanied by new issues related to data regulation (GDPR, Cnil, etc.) and data security: already existing topics in the data warehouse world. Finding the right data for the right use is not yet an easy thing to do.

The settlement: constructing Data Governance

The internet giants understood that centralizing these data is the first step, however insufficient. The last brick necessary to go towards a “data-driven” approach is to construct data governance. Innovating through data requires greater knowledge of these data. Where are my data stored? Who uses them? With which goal in mind? How are they being used? 

To help data professionals chart and visualize the data life cycle, new tools have appeared: we call them, “Data Catalogs.” Located above data infrastructures, they allow you to create a searchable metadata directory. They make it possible to acquire a business vision and data techniques by centralizing all collected information. In the same way that Google doesn’t store web pages but rather, their metadata to reference them, companies must also store their data’s metadata in order to facilitate the exploitation of and discovery of them. Gartner confirms this in their study, “Data Catalog is the New Black”: if your data lake’s data is without metadata management and governance, it will be considered inefficient. 

Thanks to these new tools, data becomes an asset for all employees. The easy-to-use interface doesn’t require technical skills, becoming a simple way to know, organize, and manage these data. The data catalog becomes the reference collaborative tool in the enterprise. 

Acquiring an all-round view of these data and to start data governance to drive ideations thus becomes possible.  

How Artificial Intelligence enhances data catalogs

How Artificial Intelligence enhances data catalogs

Can machines think? We are talking about artificial intelligence, “the biggest myth of our time”!

A simple definition for AI could be: “a set of applied theories and techniques to create machines capable of simulating intelligence.” Among those AI functions, there is deep learning, an automated learning method used to process data.

Data must be understood and accessible. It’s with the help of an intelligent data catalog that data users, such as data scientists, can easily research and efficiently choose the right datasets for their machine learning algorithms.

Let’s see how.

Search engine: facilitation dataset research

By connecting to all of an enterprise’s data sources, a data catalog can efficiently pull up a maximum amount of documentation (otherwise known as metadata) from its storage systems.

This information, indexed and filterable in Zeenea’s search engine, allows for data users to quickly attain the data sets needed for their information systems.

Recommendation system

Guiding Data Scientists in their choices

An intelligent data catalog is a tool that rests on “fingerprinting” technology. This intelligent feature gives recommendations to data users as to what data sets are the most relevant for their projects based on, among others:

  • How the data is used,
  • The quality and scoring of the documentation,
  • Its previous searches,
  • What other users search for.
  • Give more meaning to their datasets

This feature offers data users that are responsible for a particular data set some suggestions as for its documentation. These recommendations can, for example, be associated with tags, contacts, or even business terms of other data sets based on:

  • The analysis on the data itself (statistical analysis),
  • The schema resembling other data sets,
  • The links on the other data set’s fields.
  • Automatically contextualizing data sets in a data catalog allows for any data user to work with data that is understood and appropriate for their use cases.

Automatic dataset linking: visualizing your data’s life cycle

As mentioned above, with fingerprinting technology, a data catalog can recognize and connect to other data sets. We are talking about data lineage: a visual representation of data life cycles.

Automatic error detection: be aware of errors in datasets

In order to overcome potential data interpretation problems, an intelligent data catalog must be able to automatically detect errors or misunderstandings in the quality and documentation of any data.

This key feature, based on the analysis of data or its documentation, must alert data users of its integrity.

GDPR notification: be notified of sensitive information

An intelligent data catalog must be able to detect personal/private data in any given data set and report it on its interface. This feature helps enterprises respond to the different GDPR demands put into place in May 2018, and also to alert potential users on the sensitivity level as well as the use of their data.

The role of metadata in a data-driven strategy

The role of metadata in a data-driven strategy

Our conviction requires a company to make compromises between control and flexibility in the use of data. In short, companies must be able to adopt a data strategy both encouraging and easy- to-use, all while minimizing risks.

We are convinced that such governance is achievable if your collaborators are likewise able to answer these few questions:

  • What data are present in our organization?
  • Are these data sufficiently documented to be understood and mastered by the collaborators in my organization?
  • Where do they come from?
  • Are they secure?
  • What rules or restrictions apply to my data?
  • Who are the people in charge? Who are the “knowers”?
  • Who uses these data? How?
  • How can your collaborators access it?

These metadata (information about data) become strategic information within enterprises. They describe various technical, operational or business aspects of the data you have.

By constituting a unified metadata repository, both centralized and accessible, you are guaranteed precise data, which are consistent and understood by the entire enterprise.

The benefits of a metadata repository

We bring our experiences to enhance a well-founded governance on metadata management. We are firmly convinced that we cannot govern what we do not know! Thus, to build a metadata repository constitutes a solid working base to start a governance of your data.

It will allow, among others, to :

  • Curate your asset;
  • Assign roles and responsibilities on your referenced data;
  • Be completed by your employees in a collaborative manner;
  • Strengthen your regulatory compliance.

The concentration of efforts on metadata and the creation of such a frame of reference is one key characteristic of a data governance with an agile approach.

Download our white paper
Why start an agile data governance?

Data catalog: a self-service data platform

Data catalog: a self-service data platform

A data catalog is a portal that brings metadata on collected data sets together by the enterprise. This classified and organized information lets data users to re(find) relevant data sets for their work.

A new wave of data catalogs appeared on the market. Their purpose is signing up an enterprise in a data-driven approach. Any authorized person in the enterprise must have the capability of accessing, understanding, and contributing to data documentation and moreover, without technical skills. What we are talking about is self-service data.

Zeenea identified the 4 characteristics that the new generation of a data catalog must respect. It must be:

  • An enterprise’s data catalog. A data catalog must be connected to all of the enterprise’s data sources to collect and regroup all metadata in a single centralized location to avoid the multiplication of tools.

  • A catalog of connected data. We believe that a data catalog must always be up to date and accurate on the information it provides in order to be useful for its users. By being connected to data sources, the data catalog can import the documentation from storage systems and ensure an automatic update of metadata in the two structures (storages and data catalog).

  • A collaborative data catalog. In a user-centric approach, a data catalog must be the reference data tool of an enterprise. By involving employees through collaborative features, the enterprise benefits from collective intelligence. To share, to assign, to comment, and to qualify within the same data catalog, increasing productivity and knowledge among all of your collaborators.

  • An intelligent data catalog. By choosing a data catalog equipped with artifical intelligence for the auto-population of metadata, for example, it allows your data managers to become more efficient.

These characteristics will be the subject of more in-depth articles.

The 3 types of metadata to master to be a data-centric enterprise

The 3 types of metadata to master to be a data-centric enterprise

The 3 types of metadata to master to be a data-centric enterprise!

Metadata is structured information that describes, explains, tracks, and facilitates the access, use, and management of an information resource. The most frequently cited definition is, “data on the data.” In a data-centric approach, what types of metadata does an enterprise have to make available to render data consumers more autonomous and productive?

Our definition of metadata

Metadata is contextualized data. In other words, they answer the questions of “who, what, where, when, why, and how,” of a data set. They must enable both IT and business teams to understand and work on relevant and quality data.

What are the three types of metadata?

At Zeenea, we speak of three types of metadata within our data catalog. Here are, among others, some examples:

  • Technical metadata They describe the structure of a data set and storage information.
  • Business metadata They apply a business context to data sets: Descriptions (context and use), the owners and referents, tags and properties with the goal of creating a taxonomy above the data sets that will be indexed by our search engine. Business metadata are also present at the schema level of a data set: descriptions, tags, and even the level of confidentiality of the data by column.

  • Operational metadata They make it possible to understand when and how the data was created or transformed: Statistical analysis of data, date of update, provenance (lineage), volume, cardinality, identifying the processing operations that created or transformed the data, the status of the processing operations on the data, etc.

Conclusion

Metadata management is an integral part of an enterprise’s agile data governance strategy. Maintaining an up-to-date metadata directory ensures that data consumers can use reliable and relevant data for their use cases.

Who are Data Stewards?

Who are Data Stewards?

Digital transformations bring about new challenges in the data industry. We are increasingly talking about data stewardship;  an activity focused around data management and documentation of an organization. In this article, we would like to present the data stewards, the enterprise’s true guardians of data, take a closer look at their role, their missions, and their tools.

This article is a summary of the interviews conducted with more than 25 data stewards in medium-sized and large French enterprises. The goal was to understand their tasks and their hardships in metadata management, providing solutions within our data catalog.

The Data Steward’s role in the enterprise

Enterprises are reorganizing themselves around their data to produce value and finally innovate from this raw material. Data stewards are here to orchestrate data systems’ data of the enterprise. They must ensure the proper documentation of data and facilitate their availability to their users, such as data scientists or project managers for example. Their communication skills enable them to identify the data managers and knowers, as well as to collect associated information in order to centralize them and perpetuate this knowledge within the enterprise. In short, data stewards provide metadata; a structured set of information describing datasets. They transform these abstract data into concrete assets for the profession.

The profession is on the rise! It deals with trending topics and its social role allows data stewards to work with both technical and professional people. Data stewards are the first point reference for data in the enterprise and serve as the entry point to access data.

They have the technical and business knowledge of data, which is why they are called “masters of data” within an organization!

Data Steward’s missions

Their objective is quite clear; a data steward must take part in the data governance of enterprises. To find and to understand these data, to impose a certain discipline in metadata management and to facilitate their availability to their users.

These are, among other things, quite a few subjects that data stewards must address. To achieve this, data stewards must ensure that data documentation that they manage are well maintained. They are free to suggest the method and format of technical and professional data documentation of their choice. Their days are punctuated by the search for data managers and knowers to enrich the knowledge they have gathered in an exploitable tool for technical and professional users. Thus, they want the actors of data projects to be able to connect and collaborate in order to improve information sharing and productivity for all.

 

Equip Data Stewards

The data steward is, therefore, a new profession where their missions are still in need of clarification, its tools to be identified, and its necessity within the enterprise to be evangelized. As a result, enterprises still have difficulty in allotting a clear budget. It is therefore difficult for them to be properly equipped to ensure the proper control and management of their data.

Yet, when well equipped, it will allow them to:

 

  • become autonomous in data management activities,

  • centralize information collected on the data,

  • manage obsolescence of documentation,

  • report errors and/or changes to data,

  • identify relevant data to send to their users,

  • expose data to their users from a collaborative tool.

Such an approach can be successful where many larger “data governance” initiatives have failed.

 

In conclusion

To this day, we are convinced that the data steward role is indispensable to construct and orchestrate efficient data governance in the enterprise. This is the direction Zeenea is taking by offering dynamic and connected documentation of the enterprise’s data. Otherwise known as data catalogs, their ambition is to become the reference tool for data stewards. To manage data in a user-friendly way. To centralize all collected metadata. To open data to its users, depending on the level of sensitivity. To manage data quality. All this in one click. Etc.

In a virtuous circle, the data catalog will bring increased value to data users once the data steward industrializes the addition of metadata and the contribution of collaborators in the tool.

What is a Data Catalog?

What is a Data Catalog?

In 2017, Gartner declared data catalogs as “the new black in data management and analytics”. Now, they have become a MUST-HAVE solution for data leaders! In “Augmented Data Catalogs: Now an Enterprise Must-Have for Data and Analytics Leaders” they state:

“The demand for data catalogs is soaring as organizations continue to struggle with finding, inventorying and analyzing vastly distributed and diverse data assets.”

At Zeenea, we broadly define a data catalog as being:

“A detailed inventory of all data assets in an organization and their metadata, designed to help data professionals quickly find the most appropriate data for any analytical business purpose.” 

Key takeaways

 

Why a Data Catalog?

Topics on data are still considered to be an extremely technical domain. However, data innovation is only possible if it is shared by as many people as possible. The profession must have the autonomy to access data to measure, start, or optimize a product or service. To innovate requires a certain flexibility and agility, which is, still to this day, scarcely present in organizations.

Democratize data knowledge!

This is the very reason for data catalogs: to allow your employees to find the data they need via one easy-to-use platform above data systems. Data catalogs don’t require technical expertise to actually discover what is new and seize opportunities.

Business analysts, data scientists, as well as business teams become autonomous in data exploration. As for CDOs and data stewards, they are finally equipped to build data governance, evangelizing a data-driven culture within their organizations.

>> Why does data culture matter? Webinar replay

What are the purposes of a Data Catalog?

A data catalog allows you to acquire a business and technical views of data stored in data sources. It centralizes and unifies information collected so that they can be shared with IT teams and business functions and then connected to the enterprise’s tools. This unified view of data allows you to:

Build Agile Data Governance

A connected data catalog enables you to curate the data directly retrieved from your enterprise’s IS. This way, your organization starts creating an understandable & reliable data asset landscape via a centralized platform. We believe in a bottom up approach where your assets’ global knowledge should be the starting point of your data governance, instead of deploying overly complex processes too difficult to maintain on assumed information. On top of this knowledge allowed by a driven data catalog, the organization would open step by step, with a retroactive loop, the creation of roles, processes and access to the data..

> Why start agile data governance? Free white paper 

agile data governance white paper

Start a metadata management

A data catalog enables you to create a technical and business metadata directory. They enable metadata synchronization with data sources and enforce documentation by your data teams (by your data owners, data stewards, users and so on), ultimately maintaining a powerful and reliable data asset landscape at the enterprise level over time.

> Read our white paper about metadata management

mockup-white-paper-3-EN-min

Sustain a data culture

A data catalog becomes the reference data tool for all employees. As its interface does not require technical expertise to discover and understand the data, the knowledge of the data assets is no longer limited to a group of experts. It also allows your organization to better collaborate on those assets and work on them in a simple way.  At Zeenea, we consider that a data catalog is a cornerstone to build a powerful data democracy

> Read our white paper about Data democracy 

Accelerate data discovery

As thousands of datasets and assets are being created each day, enterprises find themselves struggling to understand and gain insights from their information to create value. Many recent surveys still state that data science teams spend 80% of their time preparing and tidying their data instead of analyzing and reporting it. By deploying a data catalog in your organization, the speed of data discovery can increase up to 5 times.  So your data teams can focus on what’s important: delivering their data projects on time.

> Read our white paper about Data Discovery through the eyes of Web Giants

data-discovery-mockup-EN-no-shadow

What are the key features of a Data Catalog?

Metadata registry

For each element, this metadata registry can include a business and technical description, the owners, and quality indicators or also create a taxonomy (properties, tags, etc.).

metadata-resgistry-data-catalog-1

Search Engine

All collected metadata in the registry is requestable from the data catalog’s search engine. The searches can be sorted, filtered at all levels.

search-engine-data-catalog-2

Data lineage and processing registry

Thanks to data lineage, it is possible to visualize in whole the origin and the transformations of one specific data over time. This allows you to understand where the data originate from, when and where they separate and fuse with other data.

These transformations and treatments carried out by the data are in this way repositories in what we call a registry of treatments, indispensable in responding to the expectations of the GDPR and other upcoming data regulations.

data-lineage-data-catalog-3

Collaborative functions

In a user-centric approach, a data catalog is the reference data tool of an enterprise. It allows data to be visualized as an asset and to work in a transparent manner on it. To share, to assign, to comment, to qualify inside the tool itself to increase the productivity and the knowledge amongst all the collaborators.

dashboard-data-catalog-4

Personal Information detection / enriched documentation  

With machine learning and artificial intelligence, a data catalog is able to detect sensitive data directly within our platform and when new data is imported into it. A data catalog is able to monitor data activity and warn data stewards in case of problems.

What are a Data Catalog’s use cases? And for whom?

For Chief Data Officers

Learn more about Chief Data Officers >

The Chief Data Officer plays a key role in the overall data strategy of an enterprise; their purpose is to master their data and facilitate their access in order to become data-driven. A data catalog helps them :

  • Ensure data reliability and value
  • Create a data literate organization 
  • Valorize a data set’s context for data explorers
  • Evangelize a data culture with rights and duties
  • Start a compliance process with the European regulation (GDPR).

For Data Stewards

Learn more about Data Stewards >

Known as the main contact for data inquiries thanks to their technical and operational knowledge, the Data Steward is most commonly nicknamed the “Master of data”! A data catalog enables data stewards to:

  • Centralize data knowledge in a single platform
  • Enrich data documentation
  • Establish communication between them and data explorers
  • Qualify the value of data.
  • Start metadata management

> Learn more about our data catalog for data managers: Zeenea Studio

 

For Data Scientists 

A data scientist’s missions are, among others, to develop predictive models, to make data understandable and exploitable for the enterprise’s top management, and build machine learning algorithms.

To achieve their missions, collaborators must be able to determine what data is available, which ones they really need, understand the data (context and quality), and finally know how to retrieve them! A data catalog helps them:

  • Easily find data, regardless of where they are stored.
  • View the history of the data sets: date of creation and the actions carried out on it.
  • Understand the professional context of data.
  • Identify the knowers by data set.
  • Easily collaborate with peers.
  • Create automated documentation through my actions within the data catalog.
  • Recommendation of relevant data in relation to other consulted data sets.

> Learn more about our data catalog for data teams: Zeenea Explorer

 

A representative data catalog journey

It’s a fact that data catalogs are an essential brick in any organization’s data strategy, and this for a reason. A data catalog becomes extremely handy in the different phases of your projects:

A data catalog in the deployment phase

Connect to your data sources

A data catalog plugs to all your data sources. Connect your data integration, data preparation, data visualization, CRM solutions, etc in order to fully integrate all your technologies into a single source of truth. 

View our connectors

A data catalog in the documentation phase

Create a metamodel

A data catalog captures and updates technical and operation metadata from an enterprise’s data sources.  It allows you to add and configure – at the hand of the data catalog’s administrator –  or overlay information (information that can be mandatory or not) on its cataloged datasets. This additional information is called properties! This contextual information is mainly referred to business and operational documentation.

Build your metamodel template

A data catalog in the discovery phase

Understand your data

With a data catalog, your data citizens – with technical capabilities or not – are able to fully understand their enterprise data. A data catalog allows users to have access to and easily search for any information within the catalog. 

Define your data

A data catalog allows data leaders, such as data stewards or chief data officers, to correctly define the pertinent data to be used. Through metadata, data managers can easily document their datasets, allowing their data teams to access contextualized data. 

Explore your data

Discover and collect available data in a data catalog. By cataloguing all enterprise data in a central repository, data citizens are able to ensure that their data is reliable and usable.

A data catalog in the collaboration phase

Communicate with data

A data catalog allows users to become data fluent. Both the IT & business departments are able to understand and communicate around different data projects. Through collaborative features such as discussions, data becomes a topic for all to share across the enterprise. 

The key takeaways of a data catalog

Now that we know everything about data catalogs, there are three main takeaways to keep in mind that data catalogs do to for your enterprise:  

Maximize the value of data

By collecting all the data of an enterprise on a reference data tool, it becomes possible to cross-reference these assets and get value from them more easily. The collaboration of technical and professional teams within the data catalog enables innovations that meet proven market needs.

Produce better and faster

Your teams have confirmed it: more than 70% of the dedicated time to data analysis is invested in “data quarrels” activities. Cataloging simplifies data retrieval, the identification of knowers, and therefore, intelligent decision-making.

Ensure good control over data

Misinterpreted or erroneous, enterprises expose themselves to the risk of basing their decision on incorrect information. Connected data catalogs permit access to always up-to-date data. Data users can ensure that data and their information are correct and usable.

> Download our white paper: Why do you need a data catalog to be data centric?

Data mapping: The challenges in an organization

Data mapping: The challenges in an organization

The arrival of Big Data did not simplify how enterprises work with data. The volume, the variety, and the various data storage systems are exploding.

To prove this, Matt Turck published what we call the Big Data Landscape. Updated every year, this infographic shows the different key players in various sub-domains of the Big Data landscape.

Thus, with the Big Data revolution, it is even more difficult to answer “primary” questions related to data mapping:

  • What are the most pertinent datasets and tables for my use cases and my organization?

  • Do I have sensitive data? How are they used?

  • Where does my data come from? How have they been transformed?

  • What will be the impacts on my datasets if they are transformed?

>> Download our toolkit: Metamodel template <<

So many questions that information systems managers, Data Lab managers, Data Analysts or even Data Scientists ask themselves to be able to deliver efficient and pertinent data analysis.

Among others, these questions allow enterprises to:

  • Improve data quality: Providing as much information as possible allows users to know if the data is suitable for use.

  • Comply with European regulations (GDPR): mark personal data and the carried out processes.

  • Make employees more efficient and autonomous in understanding data through graphical and ergonomic data mapping.

To put these into action, companies must build what is called data lineage.

How to map your information system’s data?

How to map your information system’s data?

Data lineage is defined as the life cycle of data: its origin, movements, and impacts over time. It offers greater visibility and simplifies data analysis in case of errors.

With the emergence of Big Data and information systems becoming more complex, data lineage becomes an essential tool for data-driven enterprises. How can we represent the life cycle of data in an intelligible way, maintainable with a certain granularity in the information provided?

We are witnessing a paradigm shift in the representation and formalization of data mapping.

View the video (FRENCH)

Feature note : The Metadata Search Engine

Feature note : The Metadata Search Engine

Zeenea’s very purpose is to be an enterprise’s metadata catalog!

Indexed in our tool, teams become autonomous in relevant data set research for the implementation of their innovative projects.

Search for the right dataset and you will find it!

All cataloged metadata are indexed in a search engine, allowing for the discovery of data sets or attributes (columns) of the catalog by keyword search.

It also lets you filter the results of this search (sorted by relevance) by using Zeenea’s tag system and data set properties.

Sample

In order to check the relevance of an identified data set, thanks to the search engine, it’s possible to explore the cataloged data as well as to download a sample of this data set from Zeenea’s graphical interface.

How to map your information system’s data?

Data lineage in a big data environment

Data lineage is defined as a type of data life cycle. It is a detailed representation of any data over time: its origin, processes, and transformations. Although this isn’t a brand new concept, a paradigm shift is taking place…

Obtaining data lineage from a Data Warehouse, for example, was a pretty simple task. This centralized storage system allowed, “by design,” you to obtain data lineage from the data stored in the same place.

The data ecosystem has been evolving at a very rapid pace since the emergence of Big Data due to the appearance of various technologies and storage systems that complicate information systems in enterprises.

It has become impossible both to keep and to impose a single centralized tool in organizations. Softwares and methods used by urbanists and IS architects of the “old world” have become less and less maintainable, making their work obsolete and illegible.

So, how can you visualize an efficient data lineage in a Big Data environment?

In order to have a global vision of an enterprise’s IS data, new tools are emerging. We are talking about a data catalog. It allows for a maximum amount of metadata from all data storages to be treated via a user-friendly interface. By centralizing all of this information, it is possible to create data lineage in a Big Data environment at different levels:

  • At Datasets level. It can be a table in Oracle, a topic in Kafka or even a directory in the data lake. A data catalog highlights the processes and datasets that made it possible to create the final dataset.

However, this data lineage standard on its own does not make it possible for data users to answer all of their questions. Among others, these questions remain: what about sensitive data? What columns were created and with what processes? etc.

  • At Column level. A more granular way to approach this topic is to represent the different transformation stages of a dataset in a timeline of actions/events. By selecting a specific field, users will be able to see what columns and actions created it.