Data Revolutions: Towards a Business Vision of Data

Data Revolutions: Towards a Business Vision of Data

The use of massive data by the internet giants in the 2000s was a wake-up call for enterprises: Big Data is a lever for growth and competitiveness that encourages innovation. Today, enterprises are re-organizing themselves around their data in order to adopt a “data-driven” approach. It’s a story constituting several twists and turns that tends to finally find a solution.

This article discusses the different enterprise data revolutions undertaken in recent years up to now, in an attempt to maximize the business value of data.

SILOED ARCHITECTURES

In the 80s, Information Systems developed immensely. Business applications were created, advanced programming language emerged, and relational databases appeared. All these applications stayed on their owners’ platforms, isolated from the rest of the IT ecosystem. 

For these historical and technological reasons, an enterprise’s internal data were distributed in various technologies and in heterogeneous formats. In addition to organizational problems, we then speak of a tribal effect. Each IT department have their own tools and implicitly,  manage their own data for their own uses. We are witnessing a type of data hoarding within organizations. To back these suggestions, we frequently recall Conway’s law: “All architecture reflects the organization that created it.” Thus, this organization, called silos, makes for very complex and onerous cross-referencing of data originating from two different systems. 

The search for a centralized and comprehensive vision of an enterprise’s data will lead Information Systems to a new revolution. 

THE CONCEPT OF A DATA WAREHOUSE

By the end of the 90s, Business Intelligence was in full swing. For analytical purposes and with the goal of responding to all strategic questions, the concept of a data warehouse appeared. 

To make this, we will recover the data from mainframes or relational databases and transfer them to an ETL (Extract Transform Loader). Projected in a so-called pivot format, analysts and decision-makers can access data collected and formatted to answer pre-established questions and specific cases of reflection. From the question, we get a data model!

This revolution always comes with some problems…Using ETL tools has a certain cost, not to mention the hardware that comes with it. The elapsed time between the formalization of the need and the receipt of the report is time-consuming. It’s a revolution that is costly for perfectible efficiency.

THE NEW REVOLUTION OF A DATA LAKE…

The arrival of data lakes reverses the previous reasoning.  A data lake enables organizations to centralize all useful data storages, regardless of their source or format, for a very low cost. . We stock an enterprise’s data without presuming their usage in the treatment of a future use case. It is only according to a specific use where we will select these raw data and transform them into strategic information. 

We are moving from an “a priori” to an “a posteriori” logic. This revolution of a data lake focuses on new skills and knowledge: data scientists and data engineers are capable of launching the treatment of data, producing results much faster than the time spent using data warehouses. 

Another advantage of this Promised Land is its’ price. Often offered in an open-source way, data lakes are cheap, including the hardware that comes with them. We often speak of community hardware. 

… OR RATHER, A DATA SWAMP

Certain advantages are present with the data lake revolution but come along with new challenges. The expertise needed to instantiate and to maintain these data lakes are rare and thus, are costly for enterprises. Additionally, pouring data in a data lake day after day without efficient management or organization brings on the serious risk of rendering the infrastructure unusable. Data are then inevitably lost in the mass.

This data management is accompanied by new issues related to data regulation (GDPR, Cnil, etc.) and data security: already existing topics in the data warehouse world. Finding the right data for the right use is not yet an easy thing to do.

THE SETTLEMENT: CONSTRUCTING DATA GOVERNANCE

The internet giants understood that centralizing these data is the first step, however insufficient. The last brick necessary to go towards a “data-driven” approach is to construct data governance. Innovating through data requires greater knowledge of these data. Where are my data stored? Who uses them? With which goal in mind? How are they being used? 

To help data professionals chart and visualize the data life cycle, new tools have appeared: we call them, “Data Catalogs.” Located above data infrastructures, they allow you to create a searchable metadata directory. They make it possible to acquire a business vision and data techniques by centralizing all collected information. In the same way that Google doesn’t store web pages but rather, their metadata to reference them, companies must also store their data’s metadata in order to facilitate the exploitation of and discovery of them. Gartner confirms this in their study, “Data Catalog is the New Black”: if your data lake’s data is without metadata management and governance, it will be considered inefficient. 

Thanks to these new tools, data becomes an asset for all employees. The easy-to-use interface doesn’t require technical skills, becoming a simple way to know, organize, and manage these data. The data catalog becomes the reference collaborative tool in the enterprise. 

Acquiring an all-round view of these data and to start data governance to drive ideations thus becomes possible.  

How Artificial Intelligence enhances data catalogs

How Artificial Intelligence enhances data catalogs

 

Can machines think? We are talking about artificial intelligence, “the biggest myth of our time”!

A simple definition for AI could be: “a set of applied theories and techniques to create machines capable of simulating intelligence.” Among those AI functions, there is deep learning, an automated learning method used to process data.

Data must be understood and accessible. It’s with the help of an intelligent data catalog that data users, such as data scientists, can easily research and efficiently choose the right datasets for their machine learning algorithms.

Let’s see how.

 

SEARCH ENGINE: FACILITATING DATASET RESEARCH

By connecting to all of an enterprise’s data sources, a data catalog can efficiently pull up a maximum amount of documentation (otherwise known as metadata) from its storage systems. This information, indexed and filterable in Zeenea’s search engine, allows for data users to quickly attain the data sets needed for their information systems.

RECOMMENDATION SYSTEM

 

GUIDING DATA SCIENTISTS IN THEIR CHOICES

An intelligent data catalog is a tool that rests on “fingerprinting” technology. This intelligent feature gives recommendations to data users as to what data sets are the most relevant for their projects based on, among others:

  • How the data is used,
  • The quality and scoring of the documentation,
  • Its previous searches,
  • What other users search for.

GIVE MORE MEANING TO A DATA SET

This feature offers data users that are responsible for a particular data set some suggestions as for its documentation. These recommendations can, for example, be associated with tags, contacts, or even business terms of other data sets based on:

  • The analysis on the data itself (statistical analysis),
  • The schema resembling other data sets,
  • The links on the other data set’s fields.

Automatically contextualizing data sets in a data catalog allows for any data user to work with data that is understood and appropriate for their use cases.

 

AUTOMATIC DATA SET LINKING: VISUALIZING DATA LIFE CYCLE

As mentioned above, with fingerprinting technology, a data catalog can recognize and connect to other data sets. We are talking about data lineage: a visual representation of data life cycles.

 

AUTOMATIC ERROR DETECTION: BE AWARE OF ERRORS IN DATA SETS

In order to overcome potential data interpretation problems, an intelligent data catalog must be able to automatically detect errors or misunderstandings in the quality and documentation of any data.

This key feature, based on the analysis of data or its documentation, must alert data users of its integrity.

 

GDPR NOTIFICATION: BE NOTIFIED OF SENSITIVE DATA

An intelligent data catalog must be able to detect personal/private data in any given data set and report it on its interface. This feature helps enterprises respond to the different GDPR demands put into place in May 2018, and also to alert potential users on the sensitivity level as well as the use of their data.

 

The role of metadata in a data-driven strategy

The role of metadata in a data-driven strategy

 Our conviction requires a company to make compromises between control and flexibility in the use of data. In short, companies must be able to adopt a data strategy both encouraging and easy- to-use, all while minimizing risks.

We are convinced that such governance is achievable if your collaborators are likewise able to answer these few questions:

 

  • What data are present in our organization?
  • Are these data sufficiently documented to be understood and mastered by the collaborators in my organization?
  • Where do they come from?
  • Are they secure?
  • What rules or restrictions apply to my data?
  • Who are the people in charge? Who are the “knowers”?
  • Who uses these data? How?
  • How can your collaborators access it?

These metadata (information about data) become strategic information within enterprises. They describe various technical, operational or business aspects of the data you have.

By constituting a unified metadata repository, both centralized and accessible, you are guaranteed precise data, which are consistent and understood by the entire enterprise.

 

THE BENEFITS OF A METADATA REPOSITORY

We bring our experiences to enhance a well-founded governance on metadata management. We are firmly convinced that we cannot govern what we do not know! Thus, to build a metadata repository constitutes a solid working base to start a governance of your data.

It will allow, among others, to :

  • Curate your asset;
  • Assign roles and responsibilities on your referenced data;
  • Be completed by your employees in a collaborative manner;
  • Strengthen your regulatory compliance.

The concentration of efforts on metadata and the creation of such a frame of reference is one key characteristic of a data governance with an agile approach.

 

Download our white paper

WHY START AN AGILE DATA GOVERNANCE?

Data catalog: a self-service data platform

Data catalog: a self-service data platform

A data catalog is a portal that brings metadata on collected data sets together by the enterprise. This classified and organized information lets data users to re(find) relevant data sets for their work.

A new wave of data catalogs appeared on the market. Their purpose is signing up an enterprise in a data-driven approach. Any authorized person in the enterprise must have the capability of accessing, understanding, and contributing to data documentation and moreover, without technical skills. What we are talking about is self-service data.

Zeenea identified the 4 characteristics that the new generation of a data catalog must respect. It must be:

  • An enterprise’s data catalog. A data catalog must be connected to all of the enterprise’s data sources to collect and regroup all metadata in a single centralized location to avoid the multiplication of tools.

  • A catalog of connected data. We believe that a data catalog must always be up to date and accurate on the information it provides in order to be useful for its users. By being connected to data sources, the data catalog can import the documentation from storage systems and ensure an automatic update of metadata in the two structures (storages and data catalog).

  • A collaborative data catalog. In a user-centric approach, a data catalog must be the reference data tool of an enterprise. By involving employees through collaborative features, the enterprise benefits from collective intelligence. To share, to assign, to comment, and to qualify within the same data catalog, increasing productivity and knowledge among all of your collaborators.

  • An intelligent data catalog. By choosing a data catalog equipped with artifical intelligence for the auto-population of metadata, for example, it allows your data managers to become more efficient.

These characteristics will be the subject of more in-depth articles.

The 3 types of metadata to master to be a data-centric enterprise

The 3 types of metadata to master to be a data-centric enterprise

The 3 types of metadata to master to be a data-centric enterprise!

Metadata is structured information that describes, explains, tracks, and facilitates the access, use, and management of an information resource. The most frequently cited definition is, “data on the data.” In a data-centric approach, what types of metadata does an enterprise have to make available to render data consumers more autonomous and productive?

OUR DEFINITION OF METADATA

Metadata is contextualized data. In other words, they answer the questions of “who, what, where, when, why, and how,” of a data set. They must enable both IT and business teams to understand and work on relevant and quality data.

WHAT ARE THE 3 TYPES OF METADATA?

At Zeenea, we speak of three types of metadata within our data catalog. Here are, among others, some examples:

  • Technical metadata They describe the structure of a data set and storage information.
  • Business metadata They apply a business context to data sets: Descriptions (context and use), the owners and referents, tags and properties with the goal of creating a taxonomy above the data sets that will be indexed by our search engine. Business metadata are also present at the schema level of a data set: descriptions, tags, and even the level of confidentiality of the data by column.

  • Operational metadata They make it possible to understand when and how the data was created or transformed: Statistical analysis of data, date of update, provenance (lineage), volume, cardinality, identifying the processing operations that created or transformed the data, the status of the processing operations on the data, etc.

CONCLUSION

Metadata management is an integral part of an enterprise’s agile data governance strategy. Maintaining an up-to-date metadata directory ensures that data consumers can use reliable and relevant data for their use cases.

Who are Data Stewards?

Who are Data Stewards?

Digital transformations bring about new challenges in the data industry. We are increasingly talking about data stewardship;  an activity focused around data management and documentation of an organization. In this article, we would like to present the data stewards, the enterprise’s true guardians of data, take a closer look at their role, their missions, and their tools.

This article is a summary of the interviews conducted with more than 25 data stewards in medium-sized and large French enterprises. The goal was to understand their tasks and their hardships in metadata management, providing solutions within our data catalog.

THE DATA STEWARD’S ROLE IN THE ENTERPRISE

Enterprises are reorganizing themselves around their data to produce value and finally innovate from this raw material. Data stewards are here to orchestrate data systems’ data of the enterprise. They must ensure the proper documentation of data and facilitate their availability to their users, such as data scientists or project managers for example. Their communication skills enable them to identify the data managers and knowers, as well as to collect associated information in order to centralize them and perpetuate this knowledge within the enterprise. In short, data stewards provide metadata; a structured set of information describing datasets. They transform these abstract data into concrete assets for the profession.

The profession is on the rise! It deals with trending topics and its social role allows data stewards to work with both technical and professional people. Data stewards are the first point reference for data in the enterprise and serve as the entry point to access data. They have the technical and business knowledge of data, which is why they are called “masters of data” within an organization!

DATA STEWARD MISSIONS

Their objective is quite clear; a data steward must take part in the data governance of enterprises. To find and to understand these data, to impose a certain discipline in metadata management and to facilitate their availability to their users.

These are, among other things, quite a few subjects that data stewards must address. To achieve this, data stewards must ensure that data documentation that they manage are well maintained. They are free to suggest the method and format of technical and professional data documentation of their choice. Their days are punctuated by the search for data managers and knowers to enrich the knowledge they have gathered in an exploitable tool for technical and professional users. Thus, they want the actors of data projects to be able to connect and collaborate in order to improve information sharing and productivity for all.

 

EQUIP DATA STEWARDS

The data steward is, therefore, a new profession where its missions are still in need of clarification, its tools to be identified, and its necessity within the enterprise to be evangelized. As a result, enterprises still have difficulty in allotting a clear budget. It is therefore difficult for them to be properly equipped to ensure the proper control and management of their data.

Yet, when well equipped, it will allow them to:

 

  • become autonomous in data management activities,

  • centralize information collected on the data,

  • manage obsolescence of documentation,

  • report errors and/or changes to data,

  • identify relevant data to send to their users,

  • expose data to their users from a collaborative tool.

Such an approach can be successful where many larger “data governance” initiatives have failed.

 

IN CONCLUSION

To this day, we are convinced that the data steward role is indispensable to construct and orchestrate efficient data governance in the enterprise. This is the direction Zeenea is taking by offering dynamic and connected documentation of the enterprise’s data. Otherwise known as data catalogs, their ambition is to become the reference tool for data stewards. To manage data in a user-friendly way. To centralize all collected metadata. To open data to its users, depending on the level of sensitivity. To manage data quality. All this in one click. Etc.

In a virtuous circle, the data catalog will bring increased value to data users once the data steward industrializes the addition of metadata and the contribution of collaborators in the tool.