smart data catalogs

The term “smart data catalog” has become a buzzword over the past few months. However, when referring to something being “smart” most people automatically think, and rightly so, of a data catalog with only Machine Learning capabilities.

We at Zeenea, do not believe that a smart data catalog is reduced to only having ML features!

In fact, there are many different ways to be “smart”. 

This article focuses on the conference that Guillaume Bodet, co-founder and CEO of Zeenea, gave at the Data Innovation Summit 2020: “Smart Data Catalogs, A must-have for leaders”.

A quick definition of data catalog

We define a data catalog as being:

A detailed inventory of all data assets in an organization and their metadata, designed to help data professionals quickly find the most appropriate data for any analytical business purpose.

A data catalog is meant to serve different people, or end-users. All of these end-users have different expectations, needs, profiles, and ways to understand data. These end-users consist of data analysts, data stewards, data scientists, business analysts, and so much more. As more and more people are using and working with data, a data catalog must be smart for all end-users.

Click here for a more in-depth article on what is a data catalog

What does a “data asset” refer to?

An asset, financially speaking, typically appears in the balance sheet with an estimation of value. When referring to data assets, it is just as important, even more important in some cases, than other enterprise assets. The issue is that the value for data assets aren’t always known. 

However, there are many ways to tap the value of your data. There is the possibility for enterprises to directly use their data’s value, like for example selling or trading their data. Many organizations do this; they clean the data, structure it, and then proceed to sell it.

Enterprises can also make value indirectly from their data. Data assets enable organizations to:

  • Innovate for new products/services
  • Improve overall performance
  • Improve product positioning
  • Better understand markets/customers
  • Increase operational efficiency

High performing enterprises are those that master their data landscape and exploit their data assets in every aspect of their activity

The hard things about data catalogs…

When your enterprise deals with thousands of data, that usually means you are dealing with possibly

  • 100s of systems that store internal data (data warehouses, applications, data lakes, datastores, APIs, etc) as well as external data from partners.
  • 1,000s of datasets, models, and visualizations (data assets) that are composed of thousands of fields.
  • And these fields contain millions of attributes (or metadata)!

Not to mention the hundreds of users using them…

This raises two different questions.  

How can I build, maintain, and enforce the quality of my information for my end-users to trust in my catalog?

How can I quickly find data assets for specific use cases?

The answer is in smart data catalogs!

At Zeenea, we believe that are five core areas of “smartness” for a data catalog. It must be smart in its:

  • Design: the way users explore the catalog and consume information,
  • User experience: how it adapts to different profiles,
  • Inventories: provides a smart and automatic way of inventorying,
  • Search engine: supports the different expectations and gives smart suggestions,
  • Metadata management: a catalog that tags and links data together through ML features. 

Let’s go into detail for each of these areas.

A smart design

Knowledge graph

A data catalog with smart design uses knowledge graphs rather than static ontologies (a way to classify information, most of the time built as a hierarchy).  The problem with ontologies is that they are very hard to build and maintain, and usually only certain types of profiles truly understand the various classifications.

A knowledge graph on the other hand, is what represents different concepts in a data catalog and what links objects together through semantic or static links. The idea of a knowledge graph is to build a network of objects, and more importantly, create semantic or functional relationships between the different assets in your catalog.

Basically, a smart data catalog provides users with a way to find and understand related objects.

Adaptive metamodels

In a data catalog, users will find hundreds of different properties, to which aren’t relevant to some users. Typically, two types of information are managed:

  1. Entities: plain objects, glossary entries, definitions, models, policies, descriptions, etc.
  2. Properties: the attributes that you put on the entities (any additional information such as create date, last updated date, etc.)

The design of the metamodel must serve the data consumer. It needs to be adapted to new business cases and must be simple enough to manage for users to maintain and understand it. Bonus points if it is easy to create new types of objects and sets of attributes!

 

Semantic attributes

 Most of the time, in a data catalog, the metamodel’s  attributes are technical properties. Some of the attributes on an object include generic types such as text, number, date, list of values, and so on. As this information is necessary to have, it is not completely sufficient because they do not have information on the semantics, or meaning. The reason this is important is because with this information, the catalog can adapt the visualization of the attribute and improve suggestions to users.

In conclusion, there is one size fits all to a data catalog’s design, and it must evolve in time to support new data areas and use cases.

knowledge-graph

A smart user experience

 As stated above, a data catalog holds a lot of information and end-users often struggle to find the information of interest to them. Expectations differ between profiles! A data scientist will expect statistical information, whereas a compliance officer expects information on various regulatory policies. 

With smart and adaptive user experience, a data catalog will present the most relevant information to specific end-users. Information hierarchy and adjusted search results in a smart data catalog is based on:

  • Static preferences: already known in the data catalog if the profile is more focused on data science, IT, etc.
  • Dynamic profiling: to learn what the end-user usually searches, their interests, and how they’ve used the catalog in the past.

A smart inventory system

A data catalog’s adoption is built on trust – and trust can only come if its content is accurate. As the data landscape moves at a fast pace, it must be connected to operational systems to maintain the first level of information on metadata on your data assets.

The catalog must synchronize its content with the actual content of the operational systems.

A catalog’s typical architecture is to have scanners that scan your operational systems and bring and synchronize information from various sources (Big Data, noSQL, Cloud, Data Warehouse, etc.). The idea is to have universal connectivity so enterprises can scan any type of system automatically and set them in the knowledge graph.

In Zeenea, there is an automation layer to bring back the information from the systems to the catalog. It can:

  • Update assets to reflect physical changes
  • Detect deleted or moved assets
  • Resolve links between objects
  • Apply rules to select the appropriate set of attributes and define attribute values 
smart-inventorying-zeenea

 A smart search engine

In a data catalog, the search engine is one of the most important features. We distinguish between two kinds of searches:

  • High intent search: the end-user already knows what they are looking for and has precise information on their query. They either already have the name of the dataset or already know where it is found. Low intent searches are commonly used by more data savvy people.
  • Low intent search: the end-user isn’t exactly sure what they are looking for, but want to discover what they could use for their context. Searches are made through keywords and users expect the most relevant results to appear. 

 A smart data catalog must support both types of searches!

It must also provide smart filtering. It is a necessary complement to the user’s search experience (especially low intent research), allowing them to narrow their search results by excluding attributes that aren’t relevant. Just like many big companies like Google, Booking.com, and Amazon, the filtering options must be adapted to the content of the search and the user’s profile in order for the most pertinent results to appear. 

Smart metadata management

 Smart metadata management is usually what we call the “augmented data catalog”, the catalog that has machine learning capabilities that will enable it to detect certain types of data, apply tags, or statistical rules on data.

A way to make metadata management smart is to apply data pattern recognition. Data pattern recognition refers to being able to identify similar assets and rely on statistical algorithms and ML capabilities that are derived from other pattern recognition systems.

This data pattern recognition system helps data stewards set their metadata:

  • Identify duplicates and copy metadata
  • Detect logical data types (emails, city, addresses, and so on)
  • Suggest attribute values (recognize documentation patterns to apply to a similar object or a new one)
  • Suggest links – semantic or lineage links
  • Detect potential errors to help improve the catalog’s quality and relevance

It also helps data consumers find their assets. The idea is to use some techniques that are derived from content-based recommendations found in general-purpose catalogs. When the user has found something, the catalog will suggest alternatives based both on their profile and pattern recognition.  

Start your data catalog journey with Zeenea

Zeenea is a 100% cloud-based solution, available anywhere in the world with just a few clicks. By choosing Zeenea Data Catalog, control the costs associated with implementing and maintaining a data catalog while simplifying access for your teams.

The automatic feeding mechanisms, as well as the suggestion and correction algorithms, reduce the overall costs of a catalog, and guarantee your data teams with quality information in record time.