Zeenea Product Recap: A look back at 2023

Zeenea Product Recap: A look back at 2023

2023 was another big year for Zeenea. With more than 50 releases and updates to our platform, these past 12 months were filled with lots of new and improved ways to unlock the value of your enterprise data assets. Indeed, our teams consistently work on features that simplify and enhance the daily lives of your data and business teams.

In this article, we’re thrilled to share with you some of our favorite features from 2023 that enabled our customers to:

  • Decrease data search and discovery time
  • Increase Data Steward productivity & efficiency
  • Deliver trusted, secure, and compliant information across the organization
  • Enable end-to-end connectivity with all their data sources

Decrease data search and discovery time

 

One of Zeenea’s core values is simplicity. We strongly believe that data discovery should be quick and easy to accelerate data-driven initiatives across the entire organization.

In fact, many data teams still struggle to find the information they need for a report or use case. Either because they couldn’t locate the data because it was scattered in various sources, files, or spreadsheets, or maybe they were confronted with an overwhelming amount of information they didn’t even know how to begin their search.

In 2023, we designed our platform with simplicity. By providing easy and quick ways to explore data, Zeenea enabled our customers to find, discover, and understand their assets in seconds.

A fresh new look for the Zeenea Explorer

 

One of the first ways our teams wanted to enhance the discovery experience of our customers was by providing a more user-friendly design to our data exploration application, Zeenea Explorer. This redesign included:

New Homepage

 

Our homepage needed a brand-new look and feel for a smoother discovery experience. Indeed, for users who don’t know what they are looking for, we added brand-new exploration paths directly accessible via the Zeenea Explorer homepage.

 

  • Browsing by Item Type: If the user is sure of the type of data asset they are looking for, such as a dataset, visualization, data process, or custom asset, they directly access the catalog with it pre-filtered with the needed type of asset.
  • Browsing through the Business Glossary: Users can quickly navigate through the enterprise’s Business Glossary by directly accessing the Glossary assets that were defined or imported by stewards in Zeenea Studio.
  • Browsing by Topic: The app enables users to browse through a list of Items that represent a specific theme, use case, or anything else that is relevant to business (more information below).
New Zeenea Explorer Homepage 2023

New Item Detail Pages

 

To understand a catalog Item at a glance, one of the first notable changes was the position of the Item’s tabs. The tabs were originally positioned on the left-hand side of the page, which took up a lot of space. Now, the tabs are at the top of the page, more closely reflecting the layout of the Studio app. This new layout allows data consumers to find the most significant information about an Item such as:

  • The highlighted properties, defined by the Data Steward in the Catalog Design,
  • Associated Glossary terms, to understand the context of the Item,
  • Key people, to quickly reach the contacts that are linked to the Item.

In addition, our new layout allows users to find all fields, metadata, and all other related items instantly. Divided into three separate tabs in the old version, data consumers now find the Item’s description and all related Items in a single “Details” tab. Indeed, depending on the Item Type you are browsing through, all fields, inputs & outputs, parent/children Glossary Items, implementations, and other metadata are in the same section, saving you precious data discovery time.

Lastly, the spaces for our graphical components were made larger – users now have more room to see their Item’s lineage, data model, etc.

New Item Detail Page Zeenea Explorer

New Filtering system

 

Zeenea Explorer offers a smart filtering system to contextualize search results. Zeenea’s preconfigured filters can be used such as by item type, connection, contact, or by the organization’s own custom filters. For even more efficient searches, we redesigned our search results page and filtering system:

 

  • Available filters are always visible, making it easier to narrow down the search,
  • By clicking on a search result, an overview panel with more information is always available without losing the context of the search,
  • The filters most relevant to the search are placed at the top of the page, allowing to quickly get the results needed for specific use cases.
New Filtering System Explorer

Easily browsing the catalog by Topic

 

One major 2023 release was our Topics feature. Indeed, to enable business users to (even more!) quickly find their data assets for their use cases, Data Stewards can easily define Topics in Zeenea Studio. To do so, they simply select the filters in the Catalog that represent a specific theme, use case, or anything else that is relevant to business.

Data teams using Zeenea Explorer can therefore easily and quickly search through the catalog by Topic to reduce their time searching for the information they need. Topics can be directly accessed via the Explorer homepage and the search bar when browsing the catalog.

Browse By Topic Explorer New

Alternative names for Glossary Items for better discovery

 

In order for users to easily find the data and business terms they need for their use cases, Data Stewards can add synonyms, acronyms, and abbreviations for Glossary Items!

Ex: Customer Relationship Management > CRM

Alternative Names Zeenea Studio

Improved search performance

 

Throughout the year, we implemented a significant amount of improvements to enhance the efficiency of the search process. The addition of stop words, encompassing pronouns, articles, and prepositions, ensures a more refined and pertinent outcome for queries. Moreover, we added an “INFIELD:” operator, enabling users the capability to search for Datasets that contain a specific field.

Search In Fields Explorer

Microsoft Teams integration

 

Zeenea also strengthened our communication and collaboration capacities. Specifically, when a contact is linked to a Microsoft email address, Zeenea now facilitates the initiation of direct conversations via Teams. This integration allows Teams users to promptly engage with relevant individuals for additional information on specific Items. Other integrations with various tools are in the works. ⭐️

Microsoft Teams Zeenea Explorer

Increase Data Steward productivity & efficiency

 

Our goal at Zeenea is to simplify the lives of data producers so they can efficiently manage, maintain, and enrich the documentation of their enterprise data assets in just a few clicks. Here are some features and enhancements that help to stay organized, focused, and productive.

Automated Datasets Import

 

When importing new Datasets in the Catalog, administrators can turn on our Automatic Import feature which automatically imports new Items after each scheduled inventory. This time-saving enhancement increases operational efficiency, allowing Data Stewards to focus on more strategic tasks rather than the routine import process.

Auto Import Zeenea Studio 2

Orphan Fields Deletion

 

We’ve also added the to manage Orphan Fields more effectively. This includes the option to perform bulk deletions of Orphan Fields, accelerating the process of decluttering and organizing the catalog. Alternatively, Stewards can delete a single Orphan Field directly from its detailed page, providing a more granular and precise approach to catalog maintenance.

Orphan Field Details

Building reports based on the content of the catalog

 

We added a new section in Zeenea Studio – The Analytics Dashboard – to easily create and build reports based on the content and usage of the organization’s catalog.

Directly on the Analytics Dashboard page, Stewards can view the completion level of their Item Types, including Custom Items. Each Item Type element is clickable to quickly view the Catalog section filtered by the selected Item Type.

For more detailed information on the completion level of a particular Item Type, Stewards can create their own analyses! They select the Item Type and a Property, and they’re able to consult, and for each value of this property, the completion level of all your Item’s template, including its description, and linked Glossary Items.

New Analytics Dashboard Gif Without Adoption

New look for the Steward Dashboard

 

Zeenea Explorer isn’t the only application that got a makeover! Indeed, to help Data Stewards stay organized, focused, and productive, we redesigned the Dashboard layout to be more intuitive to get work done faster. This includes:

 

  • New Perimeter design: A brand new level of personalization when logging in to the Dashboard. The perimeter now extends beyond Dataset completion – it includes all the Items that one is a Curator for, including Fields, Data Processes, Glossary Items, and Custom Items.
  • Watchlists Widget: Just as Data Stewards create Topics for enhanced organization for Explorer users, they can now create Watchlists to facilitate access to Items requiring specific actions. By filtering the catalog with the criteria of their choice, Data Stewards save these preferences as new Watchlists via the “Save filters as” button, and directly access them via the Watchlist widget when logging on to their Dashboard.
  • The Latest Searches widget: Caters specifically to the Data Steward, focusing on their recent searches to enable them to pick up where they left off.
    The Most Popular Items widget: The most consulted and widely used Items within the Data Steward’s Perimeter by other users. Each Item is clickable, giving instant access to its contents.

 

View the Feature Note

 

New Steward Dashboard Studio

Deliver trusted, secure, and compliant information across the organization

Data Sampling on Datasets

 

For select connections, it is possible to get Data Sampling for Datasets. Our Data Sampling capabilities allow users to obtain representative subsets of existing datasets, offering a more efficient approach to working with large volumes of data. With Data Sampling activated, administrators can configure fields to be obfuscated, mitigating the risk of displaying sensitive personal information.

This feature carries significant importance to our customers, as it enables users to save valuable time and resources by working with smaller, yet representative, portions of extensive datasets. This also allows early identification of data issues, thereby enhancing overall data quality and subsequent analyses. Most notably, the capacity to obfuscate fields addresses critical privacy and security concerns, allowing users to engage with anonymized or pseudonymized subsets of sensitive data, ensuring compliance with privacy regulations, and safeguarding against unauthorized access.

Data Sampling Zeenea Studio

Powerful Lineage capabilities

 

In 2022, we made a lot of improvements to our Lineage graph. Not only did we simplify its design and layout, but we also made it possible for users to display only the first level of lineage, expand and close the lineage on demand, and get a highlighted view of the direct lineage of a selected Item.

This year we made significant other UX changes, including the possibility to expand or reduce all lineage levels in one click, hide the data processes that don’t have at least one input and one output, and easily view the connections via a tooltip for connections that have long names.

However, the most notable release is the possibility to have Field-level lineage! Indeed, it is now possible to retrieve the input and output Fields of tables and reports, and for more context, add the operation’s description. Then, users can directly view their Field level transformations over time in the Data Lineage graph in both Zeenea Explorer and Zeenea Studio.

Field Level Lineage Zeenea Studio 2

Data Quality Information on Datasets

 

By leveraging GraphQL and knowledge graph technologies, Zeenea Data Discovery Platform provides a flexible approach to integrating best-of-breed data quality solutions. It synchronizes datasets via simple query and mutation operations from third-party DQM tool via our Catalog API capabilities. The DQM tool will deliver real-time data quality scan results to the corresponding dataset within Zeenea, enabling users the ability to conveniently review data quality insights directly within the catalog.

This new feature includes:

  • A Data Quality tab in your Dataset’s detail pages, where users can view its Quality checks as well as the type, status, description, last execution date, etc.
  • The possibility to view more information on the Dataset’s quality directly in the DQM tool via the “Open dashboard in [Tool Name]” link.
  • A data quality indicator of Datasets directly displayed in the search results and lineage.

 

View the Feature Note

Zeenea Explorer Data Quality Graph

Enable end-to-end connectivity with all their data sources

 

At Zeenea, connect to all your data sources in seconds. Our platform’s built-in scanners and APIs enable organizations to automatically collect, consolidate, and link metadata from their data ecosystem. This year, we made significant enhancements to our connectivity to enable our customers to build a platform that truly represents their data ecosystem.

Catalog Management APIs

 

Recognizing the importance of API integration, Zeenea has developed powerful API capabilities that enable organizations to seamlessly connect and leverage their data catalog within their existing ecosystem.

In 2023, Zeenea developed Catalog APIs, which help Data Stewards with their documentation tasks. These Catalog APIs include:

Query operations to retrieve specific catalog assets: Our API query operations include the retrieval of a specific asset, using its unique reference or by its name & type, or retrieving a list of assets via connection or a given Item type. Indeed, Zeenea’s Catalog APIs enable flexibility when querying by being able to narrow results to not be overwhelmed with a plethora of information.

Mutation operations to create and update catalog assets: To save even more time when documenting and updating company data, Zeenea’s Catalog APIs enable data producers to easily create, modify, and delete catalog assets. It enables the creation, update, and deletion of Custom Items and Data Processes as well as their associated metadata, and update Datasets and Data Visualizations. This is also possible for Contacts. This is particularly important when users leave the company or change roles – data producers can easily transfer the information that was linked to a particular person to another.

 

Read the Feature Note

Property & Responsibility Codes management

 

Another feature that was implemented was the ability to add code to properties & responsibilities to easily use them in API scripts for more reliable queries & retrievals.

For all properties and responsibilities that were built in Zeenea (e.g.: Personally Identifiable Information) or harvested from connectors, it is possible to modify its name and description to better suit the organization’s context.

Property Responsibility Codes Studio

More than a dozen more connectors to the list

 

At Zeenea, we develop advanced connectors to automatically synchronize metadata between our data discovery platform and all your sources. This native connectivity saves you the tedious and challenging task of manually finding the data you need for a specific business use case that often requires access to scarce technical resources.

In 2023 alone, we developed over a dozen new connectors! This achievement underscores our agility and proficiency in swiftly integrating with diverse data sources utilized by our customers. By expanding our connectivity options, we aim to empower our customers with greater flexibility and accessibility.

 

View our connectors

How AI Strengthens Data Governance

How AI Strengthens Data Governance

According to a report published by McKinsey at the end of 2022, 50% of organizations will have already integrated the use of artificial intelligence to optimize service operations and create new products. The development of AI and machine learning in everyday business reflects the eminent role of data in management development strategies. To function effectively, AI depends on vast sets of data, which must be the subject of methodical and rigorous governance.

Behind the concept of data governance lies the set of processes, policies, and standards that govern the collection, storage, management, quality, and access to data within an organization. The role of data governance? To ensure that data is accurate, secure, accessible, and compliant with current regulations. The relationship between AI and data governance is a close one. AI models learn from data, and poor quality or biased data can lead to erroneous or discriminatory decisions.

Do you want to ensure that the data used by AI systems and their algorithms is reliable, ethical, and privacy-compliant? Then data governance is an essential prerequisite. By moving forward on a dual project of AI and data governance, you create a virtuous loop. Indeed, AI can also be used to improve data governance by automating tasks such as anomaly detection or data classification.

Let’s take a look at the (many!) benefits of AI-enhanced data governance!

What are the benefits of AI-powered data governance?

Improve the quality of your data

 

Data quality must be a key fundamental of any data strategy. The more reliable the data, the more relevant the lessons, choices, and orientations that emerge from it, and AI contributes to improving data quality through a number of mechanisms. In fact, AI algorithms can automate the detection and correction of errors in datasets, thereby reducing inconsistencies and inaccuracies.

Moreover, AI can help standardize data by structuring it in a coherent way, making it easier and more reliable to use, compare, and put into perspective. With machine learning, it is also possible to identify trends and patterns hidden in the data, enabling the discovery of errors or missing data.

Automate data compliance

 

At a time when cyber threats are literally exploding, data compliance must be a priority in your organization. But guaranteeing compliance requires constant vigilance, which can’t depend exclusively on human intelligence. Especially as AI can proactively monitor potential violations of data regulations by performing real-time analysis of all data flows – detecting any anomalies or unauthorized access, triggering automatic alerts, and even making recommendations to correct any problems. In addition, AI facilitates the classification and labeling of sensitive data, ensuring that it is handled appropriately. Finally, AI systems can also generate automatic compliance reports, reducing the administrative workload.

Strengthen data security

 

Through its ability to proactively detect threats by analyzing data access patterns in real time, AI can alert about suspicious behavior, such as attempted intrusions or unauthorized access. To take data governance even further, AI leverages machine-learning-based malware detection systems. These systems can identify known malware signatures and detect unknown variants by analyzing behavior. Finally, it contributes to security by automating the management of security patches and monitoring compliance with security policies.

Democratize data

 

At the heart of your data strategy lies one objective: to encourage your employees to use data whenever possible. In this way, you will foster the development of a data culture within your organization. The key to achieving this is to facilitate access to data by simplifying the search and analysis of complex data. AI search engines can quickly extract relevant information from large datasets, enabling employees to quickly find what they need. In addition, AI can automate the aggregation and presentation of data in the form of interactive dashboards, making information ever more accessible and easy to share!

What does the future hold for data governance?

 

Increasing amounts of data, increasing levels of analysis, increasing levels of predictability. This is where history is heading. In so doing, companies will adopt more holistic approaches to their challenges: gain in perspective, distance, and proximity to their markets. To meet this challenge, it is vital to integrate data governance into the overall business strategy. In this regard, automation will be essential, relying heavily on artificial intelligence and machine learning tools to proactively detect, classify, and secure data.

The future will be shaped by greater collaboration between the IT, legal, and business teams, which will be key to ensuring the success of data governance and maintaining the trust of all stakeholders.

What is data modernization?

What is data modernization?

Data modernization is crucial to unlocking the value of data. Whether it’s breaking down silos, improving collaboration, or using AI and advanced analytics, data modernization enables data-driven decisions, trend detection, optimized operations, personalized customer experiences, and innovation. Ready to take action? Follow the guide!

Soaring inflation, market instability, changing consumer expectations, hyper-competition, accelerating time-to-market – All of these factors are causing you to rethink your processes and organization in order to become more agile and flexible. And your data is no exception. To meet these multiple challenges, data modernization is the promise for your company to take full advantage of your data in four priority areas:

  • Make better-informed decisions
  • Boost innovation,
  • Improve agility in a context of uncertainty
  • Remain competitive in ever-changing markets.

Behind the concept of data modernization lies a strategic process aimed at transforming and updating an organization’s data management practices, infrastructures, and technologies. To begin the data modernization process, you’ll need to rely on essential elements such as a fresh look at your data architecture. This entails designing and implementing more agile, flexible, and scalable data systems and structures to meet changing business needs.

The other essential component of a data modernization project is data integration. This involves unifying data from disparate sources, both internal and external, in order to create a comprehensive and consistent view of the data.

A third stage involves automating data processing and making systematic use of AI to speed up data analysis and decision-making processes.

Finally, data modernization involves strengthening the protection of sensitive data to guarantee the compliance of your data assets, and better data governance to ensure data quality, traceability, and accountability.

Why is it necessary to modernize your data?

 

The benefits of data modernization in such a complex global context seem obvious. But there are other reasons just as valid for taking the path of data modernization.

Reason #1: Embrace technological change

 

Rapid technological developments have introduced new opportunities to store, process, and analyze data more efficiently. By modernizing your data, you can exploit these new technologies and have every chance of remaining competitive.

Reason #2: Face the explosion of data

 

The amount of data generated by businesses has increased significantly. Modernization makes it possible to manage these massive volumes of data more efficiently, avoiding the saturation of existing infrastructures.

Reason #3: Harness and leverage new types of data

 

Companies are now processing a more diverse variety of data, including unstructured data such as from social media and video. Modernization makes it possible to integrate and exploit these different data sources.

Reason #4: Meet the challenge of business agility

 

You experience it every day on the job. Both your organization and your teams need to be more agile to adapt quickly to market changes. Data modernization enables you to rely on a more flexible data infrastructure, and consequently, one that is more agile!

Reason #5: Guarantee data security and compliance

 

Data protection regulations are constantly evolving. Proper modernization enhances data security and ensures compliance with legal requirements.

Reason #6: Continuously improve the quality of your data

 

Data modernization cleanses, normalizes, and enriches data, improving its quality and reliability for more accurate decision-making.

Reason #7: Stay at the forefront of innovation

 

In a massively digitally-driven world, companies that have embarked on the path of data modernization will be able to explore new opportunities for innovation, such as exploiting artificial intelligence, machine learning, and advanced analytics.

What are the best methods for data modernization?

 

Are you looking to kick-start a data modernization project in your company? It’s all about getting off to a good start. To begin with, set clear objectives. Why are you embarking on this project? What is your vision? By answering these questions, you’ll be able to lay out a clear roadmap to ensure that the process is aligned with the company’s needs and priorities.

Next, make sure you put in place effective data governance. This is based on precise processes for managing, securing, and guaranteeing data quality. It also clarifies roles and responsibilities, ensuring accountability and regulatory compliance. Knowing who does what, for when, and for whom, makes it possible to manage increasingly varied data assets on a day-to-day basis.

Third, focus on data quality. Take steps to identify and correct errors, remove duplicates, and ensure that data is accurate and consistent. High-quality data improves the confidence and efficiency of decision-making processes.

Finally, adopt a methodological approach founded on agility. Keep in mind the ‘baby steps’ method. Don’t expect a big bang, but rely on continuous iterations and adjustments in the data modernization process. This will ensure that you can adapt quickly to your company’s changing needs while minimizing turmoil.

A final piece of advice? Don’t think of data modernization as just a technological project! Involve all your teams in the data modernization project, and support the transition by training them to ensure successful adoption.

What is data normalization?

What is data normalization?

Are you concerned about data quality? If so, you should be concerned about data normalization. Data normalization consists of transforming data without distorting it, so that it corresponds to a predefined and constrained set of values to improve its efficiency.

Discover the importance of this technique, which has become indispensable for data-driven companies.

As with any company that turns to data to improve its productivity and efficiency, or the relevance of its offer or its approach to its market, data representativeness is crucial. Your challenge is to maximize the intelligence derived from your data. To achieve this, you need to do everything in your power to limit the distortion of information. This is the vocation of data normalization.

Data normalization is commonly used in statistics, data science, and machine learning to scale the values of different variables within the same interval. The main company objectives in normalization are to make data comparable with each other and to make them more easily interpretable by analysis and modeling algorithms.

Why is data normalization important for companies?

 

In many cases, data can have very different scales, i.e. some variables may have much larger or smaller values than others. This can pose problems for certain statistical techniques or machine learning algorithms, as they can be sensitive to the scale of the data. Normalization solves this problem by adjusting variable values to lie within a specified interval, often between 0 and 1, or around the mean with a given standard deviation.

What are the benefits associated with data normalization?

 

Data normalization improves the quality, performance, and interpretability of statistical analyses and machine learning models by eliminating problems associated with variable scaling and enabling fairer comparisons between different data characteristics. In practice, this translates into concrete benefits:

Maximum comparability: Normalized data are scaled to the same level, enabling easier comparison and interpretation between different variables.

Optimized machine learning: Normalization facilitates faster convergence of machine learning algorithms by reducing the scale of variables, helping to achieve more reliable and consolidated results more quickly.

Enhanced model stability: Normalization reduces the impact of extreme values (outliers), making models more stable and resistant to data variations.

Improved interpretability: Data normalization facilitates the interpretation of coefficients, making analysis more comprehensible.

What methods are used to normalize data?

 

There are several methods of data normalization, but two stand out from the crowd starting with the Min-Max Scaling method. It is based on the principle of scaling the values of a variable so that they fall within a specified interval, usually between 0 and 1. This technique is particularly useful when you want to retain the linear relationship between the original values.

Another method, called Z-Score normalization, is a more standardization-oriented technique. It transforms the values of a variable so that they have a mean of 0 and a standard deviation of 1. Unlike Min-Max normalization, standardization does not impose a specific upper or lower limit on the transformed values. This technique is recommended when variables have very different scales, as it allows data to be centered around zero and scaled with respect to standard deviation.

Other methods may also be considered for data normalization, but these are more marginal. Decimal Scaling and Unit Vector Scaling are two examples.

Decimal normalization involves dividing each value of a variable by a power of 10, depending on the number of significant digits. This moves the decimal point to the left, placing the most significant digit to the left of the decimal. This technique adjusts the values to lie within a smaller interval, thus simplifying calculations.

Unit vector normalization is used in machine learning. It consists in dividing each value of a data vector by the Euclidean norm of the vector, thus transforming the vector into a unit vector (of length 1). This technique is often used in algorithms that calculate distances or similarities between vectors.

What’s the difference between data normalization and data standardization?

 

Data normalization and data standardization address the same issue of data representativeness but from different perspectives. Although they are both data scaling techniques, they differ in the way they transform variable values.

Data standardization transforms the values of a variable so that they have a mean of 0 and a standard deviation of 1. Unlike normalization, standardization does not set a specific range for the transformed values. Standardization is useful when variables have very different scales and allows data to be centered around zero and scaled with respect to standard deviation, which can facilitate the interpretation of coefficients in some models. Depending on the nature of your data and the lessons you wish to learn from it, you may need to use either data normalization or data standardization.

What are the most common data quality issues and how can you solve them?

What are the most common data quality issues and how can you solve them?

In order to stand out from your competitors, innovate, and offer personalized products and services, collecting data is essential. However, managing data isn’t a walk in the park: every day, small problems can affect their quality. Incomplete or inaccurate data, security problems, hidden data, duplicates, inconsistencies, or inaccuracies, and the list goes on.

Here is an overview of the most common data quality related issues and some best practices to use to curb them for good!

The risks associated with poor data quality

As it’s been said over and over again, when it comes to data, the real issue is not the quantity of data but its quality. Data Quality Management (DQM) is a demanding discipline that relies on the endless questioning of data processes and constant surveillance of the very nature of the information that constitutes your data assets. Poor data quality can directly translate into lower revenues and higher operational costs, potentially resulting in financial losses for your company.

When data quality is degraded, analyses, projections, forecasts, and even decisions can be distorted. And the greater the volume of degraded data is, the greater the gap between reality and your understanding of reality is. Ensuring data quality starts with a good understanding of the errors that can affect it.

The most common data quality issues

Ensuring data quality is a key topic for any company that bases its development strategy on data. To carry out targeted actions, you need to prioritize tasks and not spread yourself too thin. Data Quality Management consists in identifying all the erroneous information that could distort your decision-making. This erroneous data can be classified into four categories.

Duplicate data

When data is duplicated, it means that the same information is present multiple times in the same database or file. Data duplication is hence one of the most harmful issues because it is often difficult to detect. Beyond 5% of duplicated data, it is considered that the quality of the data starts to be degraded. For example, CRM tools often generate duplicate data, because their users sometimes add contacts without checking their presence in the database.

Hidden data

On a daily basis, your business generates an increasing amount of data. Very often, you only leverage a limited portion of the available information. The rest of the data produced by your business gets scattered and diluted in data silos. It then remains permanently untapped. For example, a customer’s purchase history is not always available to customer service teams. Yet, this information would allow them to better identify the customer’s profile and therefore, provide more relevant answers to their specific requests, or even upsell or cross-sell by making adapted suggestions.

Inconsistent data

Are John Smith and Jon Smith really two different customers? Inconsistent data significantly affects data quality. It can also be created by another well-known phenomenon: redundancy. This phenomenon occurs when you work with multiple sources (including third-party data) in addition to your own data. Discrepancies in data formats, units, or even spelling must be tracked in a data quality approach.

Inaccurate data

It may seem obvious, but inaccurate data is probably one of the worst issues that can undermine data quality. When customer data is inaccurate, any personalized experience will not be relevant. For example, if your data inventory is inaccurate, supply difficulties or storage costs can skyrocket. Whether it’s incorrect contact information or missing or empty fields, you need to do everything you can to eradicate inaccurate data.

How to solve data quality problems

While common sense often presides over good data quality management, they are not enough to ensure it.

To meet these challenges and solve your data quality issues, you’ll need a Data Quality Management tool. But in order to choose the right solution, you will need to start by mapping your data assets in order to identify and evaluate their actual quality. Deploying a Data Quality Management solution, data governance, training, and raising awareness of your teams to good data management… are all essential pillars to limit data quality-related issues!

To learn more about DQM, feel free to download our Guide to Data Quality Management 

 

Banner Data Quality Management
What is Data Integrity?

What is Data Integrity?

Because we have entered a world where data is your company’s most valuable asset: the quality, security, and health of your data are essential. To guarantee this, you need to ensure its integrity at all times. Would you like to understand the fundamental rules of Data Integrity to set your company on the path to serene and reliable exploitation of data? Follow this guide!

While the notion of integrity is often mentioned when talking about security and data being compromised, it should not be confused with Data Integrity, which is a discipline on its own in the complex and demanding world of data exploitation.

The exact definition of Data Integrity is maintaining and ensuring the accuracy and consistency of data throughout its life cycle.

Ensuring Data Integrity means ensuring that the information stored in a database remains complete, accurate, and reliable. And this, regardless of how long it is stored, how often it is accessed, or how it is processed.

The different types of Data Integrity

The concept of Data Integrity is complex because it takes multiple forms and meanings. Beyond an overall approach to Data Integrity, it is important to understand that there are different types of Data Integrity. These different types are not in opposition to each other, but rather complement and combine each other to ensure the quality and security of your data assets.

Guaranteeing Data Integrity, in all its dimensions, is not only a matter of compliance but also of optimal use of the available information. There are two main types of Data Integrity: physical integrity on the one hand, and logical integrity on the other.

Physical integrity

Protecting the physical integrity of data means avoiding exposing it to human error and hardware failure (such as storage server malfunctions, for example).

It also means making sure that the data cannot be distorted by system programmers, for example. In the same way, the physical integrity of the data is called into question when a power failure or a fire affects a database.

Finally, the physical integrity is also compromised when a hacker manages to access the data.

Logical Integrity

Ensuring the logical integrity of your data means making sure that the data remains unchanged under all circumstances. While logical integrity is, like physical integrity, intended to protect data from human manipulation and error, it is exercised in a different way and on four distinct axes:

Entity integrity

Entity integrity is the principle of associating primary keys with the data collected. These unique values identify all of your data elements. It is an effective guarantee against duplicates, for example, because each piece of data is only listed once.

Referential integrity

The principle of referential integrity describes the series of processes that ensure that data is stored and used in a uniform and consistent manner. Repository mode is your best assurance that only the appropriate and authorized data changes, additions, or deletions are made. Referential integrity allows you to define rules to eradicate duplicate entries or to verify the accuracy of the data entered in real-time.

Domain Integrity

Domain integrity refers to the set of processes that ensure the accuracy of data attached to a domain. A domain is characterized by a set of values that are considered acceptable and that a column can contain. It can include different rules to define either the format or type of the data or the amount of information that can be entered.

User-defined integrity

User-defined integrity involves rules created by the user to meet their needs related to their own usage. By adding a number of specific business rules to Data Integrity measures, it is possible to complement the management of entity integrity, referential integrity, and domain integrity.

Why is it important to ensure data integrity?

Data integrity is important for two key reasons.

The first concerns data compliance. As the GDPR sets strict rules and provides for severe penalties, ensuring Data Integrity at all times is a major issue.

The second is related to the use of your data. When integrity is preserved, you have the certainty that the information available is reliable and of quality, and, above all, in line with reality!

The differences between Data Integrity and Data Security

Data Security is a discipline that brings together all the measures that are deployed to prevent data corruption. It is based on the use of systems, processes, and procedures that restrict unauthorized access to your data.

Data Integrity, on the other hand, addresses all the techniques and solutions that ensure the preservation of the integrity and accuracy of the information throughout its life cycle.

In other words, Data Security is one of the components that contribute to Data Integrity.

All you need to know about Data Observability

All you need to know about Data Observability

Companies are collecting and processing more data than they did before and much less than they will tomorrow. After infusing a data culture, it is essential to have complete and continuous visibility of your data. Why? To anticipate any problem and any possible degradation of the data. This is the role of Data Observability.

4.95 billion Internet users. 5.31 billion mobile users. 4.62 billion active social network users. The figures in the Digital Report 2022 Global Overview by HootSuite and We Are Social illustrate just how connected the entire world is. In 2021 alone, 79 zettabytes of data were produced and collected, a figure 40 times greater than the volume of data generated in 2010! And according to figures published by Statista, by the end of 2022, the 97 zettabyte threshold would be reached and could be doubled by 2025. This profusion of information is a challenge for a lot of companies.

Collecting, managing, organizing, and exploiting data can quickly give a headache because, as it is manipulated, and moved around, it can be degraded or even rendered unusable. Data Observability is one way to regain control over the reliability, quality, and accessibility of your data.

What is Data Observability?

Data Observability is the discipline of analyzing, understanding, diagnosing, and managing the health of data by leveraging multiple IT tools throughout its lifecycle.

In order to embark on the path of Data Observability, you will need to build a Data Observability platform. This will not only provide you with an accurate and holistic view of your data but also allow you to identify quality and duplication issues in real time. How can you do this? By relying on continuous telemetry tools.

But don’t think of Data Observability as just a data monitoring mission. It goes beyond that – it also contributes to optimizing the security of your data. Indeed, permanent vigilance on your data flows allows you to guarantee the efficiency of your security devices and acts as a means of early detection of any potential problem.

What are the benefits of data observability?

The first benefit of Data Observability is the ability to anticipate potential degradation in the quality or security of your data. Because the principle of observability is based on continuous, automated monitoring of your data, you will be able to detect any difficulties very early.

From this end-to-end and permanent visibility of your data, you can draw another benefit: that of making your data collection and processing flows more reliable. As data volumes continue to grow and all of your decision-making processes are linked to data, it is essential to ensure the continuity of information processing. Every second of interruption in data management processes can be detrimental to your business.

Data observability not only limits your exposure to the risk of interruption but also allows you to restore flows as quickly as possible in the event of an incident.

The 5 pillars of data observability

Harnessing the full potential of data observability is all about understanding the scope of your platform. This is built around five fundamental pillars.

Pillar #1: Freshness

In particular, a Data Observability platform allows you to verify the freshness of data and thus effectively fight against information obsolescence. The principle: guarantee the relevance of the knowledge derived from the data.

Pillar #2: Distribution

The notion of distribution is essential when it comes to data reliability. The concept is simple: rely on the probable value of data to predict its reliability.

Pillar #3: Volume

To know if your data is complete, you need to anticipate the expected volume. This is what Data Observability offers, which allows you to estimate, for a given sample, the expected nominal volume and compare it with the volume of data available. When the variables match, the data is complete.

Pillar #4: The Schema or Program

Know if your data has been degraded. This is the purpose of the Schema, also called the Program. The principle is to monitor the changes made to any data table and data organization to quickly identify damaged data.

Pillar #5: Lineage

By ensuring metadata collection and rigorous mapping of data sources, it is possible, like a water leak in a faucet, to pinpoint sources and points of interruption in your data handling processes in the shortest time possible and with great accuracy.

Understanding the difference between Data Observability and Data Quality

If data observability is one of the elements that allow you to continuously optimize the quality of your data, it differs, however, from Data Quality which prevails over Data Observability. Indeed, in order for observability to be fully utilized, Data Quality must first be assured.

While Data Quality measures the state of a dataset, and more specifically its suitability for an organization’s needs – while Data Observability detects, troubleshoots, and prevents problems that affect data quality and system reliability.

What is data ingestion?

What is data ingestion?

Relying simply on intuition is no longer possible – to gain a competitive advantage, it is essential to elevate your data-driven strategy. With Data Ingestion, you can access information faster and more efficiently by centralizing it in a single location. Here is an overview.

In a hyper-competitive and ever-changing business world, companies are in a race against time. A race that does not necessarily oppose their direct competitors but rather their customers. The challenge is to identify consumer trends to anticipate their expectations. Being the first to satisfy a given need or to enter an emerging market… These strategic conditions can be met through Data Ingestion.

Data Ingestion allows you to have an even finer knowledge of your customers and your market through the exploitation of increasingly heterogeneous data, and identify weak signals in order to anticipate trends quickly, but above all efficiently.

 

Understanding Data Ingestion

The principle of Data Ingestion is based on the idea of centralizing different sources of data in one place. By nature, this heterogeneous data needs to be meticulously cleaned and deduplicated in order to be brought together in a target environment for processing and exploitation. Whether your data comes from a data lake, customer files, or SaaS applications, it can be aggregated within a target site in order to reconcile it with one goal: to improve the understanding of a market, an ecosystem, or a target.

The term reconciliation perfectly sums up the purpose of Data Ingestion. The aim is to combine the knowledge contained in different types of databases to maximize the lessons learned.

 

What are the main benefits of data ingestion?

If you decide to embark on a Data Ingestion project, you will be able to reap various benefits. First, you will inevitably gain responsiveness and flexibility. Indeed, Data Ingestion tools are able to manage and process not only very large volumes of data but also a wide range of data types, including unstructured data. Data Ingestion also promises simplicity. With its ability to reconcile disparate information sources, Data Ingestion makes it much easier to extract data and restructure it into predefined formats to make it more usable.

The information that Data Ingestion gives access to can then be leveraged within advanced analytical tools. Maximizing the benefits of this in-depth knowledge of your customers or market will feed BI tools and make it easier for you to take a step back and define new strategic directions. Indeed, Data Ingestion contributes to simplifying access to data for your employees.

A more developed data culture also means faster and more informed decision-making and, consequently, a competitive advantage in defining more effective tactical and strategic levers.

 

What are the challenges of successful data ingestion?

Data Ingestion remains demanding, and a certain number of conditions need to be met to deliver its full potential. For example, ingesting very large volumes of very different data can raise data quality issues that can not only degrade the relevance of analyses but also lengthen processing times. In addition, the diversity of data sources mechanically increases exposure to vulnerabilities.

More complexity and more exposure to risk mechanically lead to a risk of increased processing costs. To succeed in a data ingestion project, you need to be aware of these risks in order to know how to protect yourself against them…

 

How to succeed in a Data Ingestion project?

The first piece of advice for an effective Data Ingestion project is to anticipate. Your ability to anticipate risks and difficulties depends on the proper mapping of your data assets.

Another lever is to activate automation. The volumes of data processed by data management are so massive that manual operations must be limited to a minimum. Automating the processing of information also has the advantage of providing more consistency in the structure of your data.

Finally, to maximize the chances of success of your Data Ingestion project, you can also consider opting for real-time Data Ingestion.

Also known as Streaming Data Ingestion, it is particularly suitable when you are looking to constantly update the knowledge you have of a market. This real-time ingestion provides a key answer to real-time decision-making issues.

Guide to Data Quality Management #3 – The main features of DQM tools

Guide to Data Quality Management #3 – The main features of DQM tools

Data Quality refers to an organization’s ability to maintain the quality of its data in time. If we were to take some data professionals at their word, improving Data Quality is the panacea to all our business woes and should therefore be the top priority. 

At Zeenea, we believe this should be nuanced: Data Quality is a means amongst others to limit the uncertainties of meeting corporate objectives. 

In this series of articles, we will go over everything data professionals need to know about Data Quality Management (DQM):

  1. The nine dimensions of Data Quality
  2. The challenges and risks associated with Data Quality
  3. The main features of Data Quality Management tools
  4. The Data Catalog contribution to DQM

One way to better understand the challenges of Data Quality is to look at the existing Data Quality solutions on the market.

From an operational point of view, how do we identify and correct Data Quality issues? What features do Data Quality Management tools offer to improve Data Quality?

Without going into too much detail, let’s illustrate the pros of a Data Quality Management tool through the main evaluation criteria of Gartner’s Magic Quadrant for Data Quality Solutions.

Connectivity

A Data Quality Management tool has to be able to gather and apply quality rules on all enterprise data (internal, external, on-prem, cloud, relational, non-relational, etc.). The tool must be able to plug into all relevant data in order to apply quality rules.

Data profiling, data measuring, and data visualization

You cannot correct Data Quality issues if you cannot detect them first. Data profiling enables IT and business users to assess the quality of the data in order to identify and understand the Data Quality issues. 

The tool must be able to carry out what is outlined in The nine dimensions of data quality to identify quality issues throughout the key dimensions for the organization.

Monitoring

The tool must be able to monitor the evolution of the quality of the data and warn management at a certain point.

Data standardization and data cleaning

Then comes the data cleaning phase. The aim here is to provide data cleaning functionalities in order to enact norms or business rules to alter the data (format, values, page layout).

Data matching and merging

The aim is to identify and delete duplicates that can be present within or between datasets.

Address validation

The aim is to standardize addresses that could be incomplete or incorrect. 

Data curation and enrichment

The capabilities of a Data Quality Management tool are what enable the integration of data from external sources and improve completeness, thereby adding value to the data.

The development and putting in place of business rules

The capabilities of a Data Quality Management tool are what enable the creation, deployment, and management of business rules, which can then be used to validate the data.

Problem resolution

The quality management tool helps both IT and business users to assign, escalate, solve, and monitor Data Quality problems.

Metadata management

The tool should also be capable of capturing and reconciling all the metadata related to the Data Quality process.

User-friendliness

Lastly, a solution should be able to adapt to the different roles within the company, and specifically to non-technical business users.

    Get our Data Quality Management guide for data-driven organizations

    For more information on Data Quality and DQM, download our free guide: “A guide to Data Quality Management” now!

    a-guide-to-data-quality-management

    Guide to Data Quality Management #2 – The challenges and risks associated with Data Quality

    Guide to Data Quality Management #2 – The challenges and risks associated with Data Quality

    Data Quality refers to an organization’s ability to maintain the quality of its data in time. If we were to take some data professionals at their word, improving Data Quality is the panacea to all our business woes and should therefore be the top priority. 

    At Zeenea, we believe this should be nuanced: Data Quality is a means amongst others to limit the uncertainties of meeting corporate objectives. 

    In this series of articles, we will go over everything data professionals need to know about Data Quality Management (DQM):

    1. The nine dimensions of Data Quality
    2. The challenges and risks associated with Data Quality
    3. The main features of Data Quality Management tools
    4. The Data Catalog contribution to DQM

      The challenges of Data Quality for organizations

      Initiatives for improving the quality of the data are usually put in place by organizations to meet the conformity requirements and risk reduction. They are indispensable for reliable decision-making. There are unfortunately many stumbling blocks that can hinder Data Quality improvement initiatives. Below are some examples:

      • The exponential growth of the volume, speed, and variety of the data make the environment more complex and uncertain;
      • Increasing pressure from conformity regulations such as GDPR, BCBS 239, or HIPAA;
      • Teams are increasingly decentralized, and each have their own domain of expertise;
      • IT and data teams are snowed under and don’t have time to solve Data Quality issues;
      • The data aggregation processes are complex and long;
      • It can be difficult to standardize data between different sources;
      • Change audits among systems are complex;
      • Governance policies are difficult to implement.

      Having said that, there are also numerous opportunities to grab. High-quality data enables organizations to facilitate innovation with artificial intelligence and ensure a more personalized customer experience. Assuming there is enough quality data. 

      Gartner has actually forecasted that until 2022, 85% of AI projects will produce erroneous data as a result of bias in the data, algorithms, or from teams in charge of data management.

      Reducing the level of risk by improving the quality of the data

      Poor Data Quality should be seen as a risk and quality improvement software as a possible solution to reduce this level of risk.

      Processing a quality issue:

      If we accept the notion above, any quality issue should be addressed in several phases:

      1. Risk identification: this phase consists in seeking out, recognizing, and describing the risks that can help/prevent the organization from reaching its objectives – in part because of a lack of Data Quality.

      2. Risk Analysis: the aim of this phase is to understand the nature of the risk and its characteristics. It includes factors for event similarities and their consequences, the nature, and importance of these consequences, etc. Here, we should seek to identify what has caused the poor quality of the marketing data. We could cite for example:

      • A poor user experience of the source system leading to typing errors;
      • A lack of verification of the completeness, accuracy, validity, uniqueness, consistency, or timeliness of the data;
      • A lack of simple means to ensure the traceability, clarity, and availability of the data;
      • The absence of a governance process and the implication for business teams.

      3. Risk evaluation: the purpose of this phase is to compare the results of the risk analysis with the established risk criteria. It helps establish whether further action is needed for the decision-making – for instance keeping the current means in place, undertaking further analysis, etc.

      Let’s focus on the nine dimensions of Data Quality and evaluate the impact of poor quality on each of them:

        Data Quality evaluation

        The values for the levels of probability and severity should be defined by the main stakeholders, who know the data in question best. 

        4. Risk processing: this processing phase aims to set out the available options to reduce risk and roll them out. This processing also involves the ability to assess the usefulness of the actions taken, determining whether the residual risk is acceptable or not – and in this last case – consider further processing.

        Therefore, improving the quality of the data is clearly not a goal in itself:

        • Its cost must be evaluated based on company objectives;
        • The treatments to be implemented must be evaluated through each dimension of quality.

        Get our Data Quality Management guide for data-driven organizations

        For more information on Data Quality and DQM, download our free guide: “A guide to Data Quality Management” now!

        a-guide-to-data-quality-management
        Guide to Data Quality Management #2 – The challenges and risks associated with Data Quality

        Guide to Data Quality Management #1 – The 9 Dimensions of Data Quality

        Data Quality refers to an organization’s ability to maintain the quality of its data in time. If we were to take some data professionals at their word, improving Data Quality is the panacea to all our business woes and should therefore be the top priority. 

        At Zeenea, we believe this should be nuanced: Data Quality is a means amongst others to limit the uncertainties of meeting corporate objectives. 

        In this series of articles, we will go over everything data professionals need to know about Data Quality Management (DQM):

        1. The nine dimensions of Data Quality
        2. The challenges and risks associated with Data Quality
        3. The main features of Data Quality Management tools
        4. The Data Catalog contribution to DQM

        Some definitions of Data Quality

        Asking Data Analysts or Data Engineers for a definition of Data Quality will provide you with very different answers – even within the same company, amongst similar profiles. Some, for example, will focus on the unity of data, while others will prefer to reference standardization. You may yourself have your own interpretation.

        The ISO 9000-2015 norm defines quality as “the capacity of an ensemble of intrinsic characteristics to satisfy requirements”. 

        DAMA International (The Global Data Management Community) – a leading international association involving both business and technical data management professionals – adapts this definition to a data context: “Data Quality is the degree to which the data dimensions meet requirements.”

        The dimensional approach to Data Quality

        From an operational perspective, Data Quality translates into what we call Data Quality dimensions, in which each dimension relates to a specific aspect of quality. 

        The 4 dimensions most often used are generally completeness, accuracy, validity, and availability. In literature, there are many dimensions and different criteria to describe Data Quality. There isn’t however any consensus on what these dimensions actually are.

        For example, DAMA enumerates sixty dimensions – when most Data Quality Management (DQM) software vendors usually offer up five or six.

         

        The nine dimensions of Data Quality

        At Zeenea, we believe that the ideal compromise is to take into account nine Data Quality dimensions: completeness, accuracy, validity, uniqueness, consistency, timeliness, traceability, clarity, and availability.

        We will illustrate these nine dimensions and the different concepts we refer to in this publication with a straightforward example:

        Arthur is in charge of sending marketing campaigns to clients and prospects to present his company’s latest offers. He encounters, however, certain difficulties:

        • Arthur sometimes sends communications to the same people several times,
        • The emails provided in his CRM are often invalid,
        • Prospects and clients do not always receive the right content,
        • Some information pertaining to the prospects are obsolete,
        • Some clients receive emails with erroneous gender qualifications,
        • There are two addresses for clients/prospects but it’s difficult to understand what they relate to,
        • He doesn’t know the origin of some of the data he is using or how he can access their source.

        Below is the data Arthur has at hand for his sales efforts. We shall use them to illustrate each of the nine dimensions of Data Quality:

          data-quality-table

           

          1. Completeness

          Is the data complete? Is there information missing? The objective of this dimension is to identify the empty, null, or missing data. In this example, Arthur notices that there are missing email addresses:

          Data Quality - Table Empty Email

          To remedy this, he could try and identify whether other systems have the information needed. Arthur could also ask data specialists to manually insert the missing email addresses.

           

          2. Accuracy

          Are the existing values coherent with the actual data, i.e., the data we find in the real world?

          Arthur noticed that some letters sent to important clients are returned because of incorrect postal addresses. Below, we can see that one of the addresses doesn’t match the standard address formats in the real world:

          Data Quality - Table Address

          It could be helpful here for Arthur to use postal address verification services.

          3. Validity

          Does the data conform with the syntax of its definition? The purpose of this dimension is to ensure that the data conforms to a model of a particular rule.

          Arthur noticed that he regularly gets bounced emails. Another problem is that certain prospects/clients do not receive the right content because they haven’t been accurately qualified. For example, the email address annalincoln@apple isn’t in the correct format and the Client Type Csutomer isn’t correct.

          Data Quality - Table Input Errors

          To solve this issue, he could for example make sure that the Client Type values are part of a list of reference values (Customer or Prospect) and that email addresses conform to a specific format.

           

          4. Consistency

          Are the different values of the same record in conformity with a given rule? The aim is to ensure the coherence of the data between several columns.

          Arthur noticed that some of his male clients complain about receiving emails in which they are referred to as Miss. There does appear to be an incoherence between the Gender and Title columns for Lino Rodrigez.

          Data Quality - Table Title and Gender

          To solve these types of problems, it is possible to create a logical rule that ensures that when the id Gender is Male, the title should be Mr.

          5. Timeliness

          Is the time lapse between the creation of the data and its availability appropriate? The aim is to ensure the data is accessible in as short a time as possible.

          Arthur noticed that certain information on prospects is not always up to date because the data is too old. As a company rule, data on a prospect that is older than 6 months cannot be used.

          Data Quality - Table Time Value

          He could solve this problem by creating a rule that identifies and excludes data that is too old. An alternative would be to harness this same information in another system that contains fresher data.

          6. Uniqueness

          Are there duplicate records? The aim is to ensure the data is not duplicated.

          Arthur noticed he was sending the same communications several times to the same people. Lisa Smith, for instance, is duplicated in the folder:

          Data Quality - Table Double

          In this simplified example, the duplicated data is identical. More advanced algorithms such as Jaro, Jaro-Winkler, or Levenshtein, for example, can regroup duplicated data more accurately.

          7. Clarity

          Is understanding the metadata easy for the data consumer? The aim here is to understand the significance of the data and avoid interpretations.

          Arthur has doubts about the two addresses given as it is not easy to understand what they represent. The names Street Address 1 and Street Address 2 are subject to interpretation and should be modified, if possible.

          data quality - clarity

          Renaming within a database is often a complicated operation and should be correctly documented with at least one description.

          8. Traceability

          Is it possible to obtain traceability from data? The aim is to get to the origin of the data, along with any transformations it may have gone through.

          Arthur doesn’t really know where the data comes from or where he can access the data sources. It would have been quite useful for him to know this as it would have ensured the problem was fixed at the source. He would have needed to know that the data he is using with his marketing tool originates from the data of the company data warehouse, itself sourced from the CRM tool.

          Data Quality - CRM

          9. Availability

          How can the data be consulted or retrieved by the user? The aim is to facilitate access to the data.

          Arthur doesn’t know how to easily access the source data. Staying with the previous schema, he wants to effortlessly access data from the data warehouse or the CRM tool. 

          In some cases, Arthur will need to make a formal request to access this information directly.

          Get our Data Quality Management guide for data-driven organizations

          For more information on Data Quality and DQM, download our free guide: “A guide to Data Quality Management” now!

          a-guide-to-data-quality-management

          The 7 lies of Data Catalog Providers – #2 A Data Catalog is NOT a Data Quality Management Solution

          The 7 lies of Data Catalog Providers – #2 A Data Catalog is NOT a Data Quality Management Solution

          The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

           These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

          The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

          The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

          A Data Catalog is NOT a Data Quality Management (DQM) Solution

           

          We at Zeenea, do not underestimate the importance of data quality in successfully delivering a data project, quite the contrary. It just seems absurd to me to put this in the hands of a solution, which by its very nature, cannot achieve the controls at the right time.

          Let us explain.

          There is a very elementary rule to quality control, a rule that can be applied virtually in any domain where quality is an issue, be it an industrial production chain, software development, or the cuisine of a 5-star restaurant: The sooner the problem is detected, the less it will cost to correct.

          To demonstrate the point, a car manufacturer is unlikely to refrain from testing the battery of a new vehicle until after its built and all the production costs have already been incurred and solving a defect would cost the most. No. Each piece is closely controlled, each step of the production is tested, defective pieces are removed before ever being integrated in the production circuit, and the entire chain of production can be halted if quality issues are detected at any stage. The quality issues are corrected at the earliest possible state of the production process where they are the least costly and the most durable.

           

          “In a modern data organization, data production rests on the same principles. We are dealing with an assembly chain whose aim is to provide usage with high added value. Quality control and correction must happen at each step. The nature and level of controls will depend on what the data is used for.”

           

          If you are handling data, you obviously have at your disposal pipelines to feed your uses. These pipelines can involve dozens of steps – data acquisition, data cleaning, various transformations, mixing various data sources, etc.

          In order to develop these pipelines, you probably have a number of technologies at play, anything from in-house scripts to costly ETLs and exotic middleware tools. It’s within those pipelines that you need to insert and pilot your quality control, as early as possible, by adapting them to what is at stake with the end product. Only measuring data quality levels at the end of the chain isn’t just absurd, it’s totally inefficient.

          It is therefore difficult to see how a Da ta Catalog (whose purpose is to inventory and document all potentially usable datasets in order to facilitate data discovery and usage) can be a useful tool to measure and manage quality.

          A Data Catalog operates on available datasets, on any systems that contain data, and should be as least invasive as possible in order to be deployed quickly throughout the organization.

          A DQM solution works on the data feed (the pipelines), focuses on production data and is, by design, intrusive and time consuming to deploy. I cannot think of any software architecture that can tackle both issues without compromising the quality of either one.

           

          Data Catalog vendors promising to solve your data quality issues are, in our opinion, in a bind and it seems unlikely they can go beyond a “salesy” demo.

           

          As for DQM vendors (who also often sell ETLs), their solutions are often too complex and costly to deploy as credible Data Catalogs.

          The good news is that the orthogonal nature of data quality and data cataloging makes it easy for specialized solutions in each domain to coexist without encroaching on each other’s lane.

          Indeed, while a data catalog isn’t purposed for quality control, it can exploit the information on the quality of the datasets it contains which obviously provides many benefits.

          The Data Catalog uses this metadata for example to share the information (and possible alerts it may identify) with the data consumers. The catalog can benefit from this information to adjust his search and recommendation engine and thus, orientate other users towards higher quality datasets.

          And both solutions can be integrated at little cost with a couple of APIs here and there.

           

          Take Away

          Data quality needs to be assessed as early as possible in the pipeline feeds.

          The role of the Data Catalog is not to do quality control but to share as much as possible the results of these controls. By their natures, Data Catalogs are bad DQM solutions, and DQM solutions are mediocre and overly complex Data Catalogs.

          An integration between a DQM solution and a Data Catalog is very straightforward and is the most pragmatic approach.

          Download our eBook: The 7 lies of Data Catalog Providers for more!

          Data quality management: the ingredients to improve the efficiency of your data

          Photo credit: Akeneo

          Having large volumes of data is useless if they are of poor quality. The challenge of Data Quality Management is a major priority for companies today. As a decision-making tool, used for managing innovation as well as customer satisfaction, monitoring data quality requires much rigor and method.

          Producing data for the sake of producing data, because it’s trendy, because your competitors are doing it, because you read about it in the press or on the Internet; all that is in the past. Today, no business sector denies the eminently strategic nature of data. 

          However, the real challenge surrounding data is that of its quality. According to the 2020 edition of the Gartner Magic Quadrant for Data Quality Solutions, more than 25% of critical data in large companies is incorrect. Which puts enterprises in a situation that generates direct and indirect costs. Strategic errors, bad decisions, various costs associated with data management… The average cost of bad data quality is 11 million euros per year

          Why is that? 

          Simply because from now on, all of your company’s strategic decisions are guided by the knowledge of your customers, your suppliers, and your partners. If we consider that data is omnipresent in your business, Data Quality becomes a priority issue. 

          Gartner is not the only one to underline this reality. At the end of 2020, IDC revealed in a study that companies are facing many challenges with their data. Nearly 2 out of 3 companies consider the identification of relevant data as a challenge, 76% of them consider that data collection can be improved, and 72% think that their data transformation processes for analysis purposes could be improved.

          Data Quality Management: A demanding discipline

          Just like when you’re cooking, the more you use quality ingredients, the more your guests will appreciate your recipe. Because data are elements that must lead to better analyses and therefore to better decisions, it is essential to ensure that they are of good quality.

          But what is quality data? Several criteria can be taken into account. The accuracy of the data (a complete telephone number), its conformity (a number is composed of 10 digits preceded by a national prefix), its validity (it is always used), its reliability (it allows you to reach your correspondent), etc.

          For an efficient Data Quality Management, it is necessary to make sure that all the criteria you have defined to consider that the data is of good quality are fulfilled. But be careful! Data must be updated and maintained to ensure its quality over time to avoid it becoming obsolete. And obsolete data, or data that is not updated, shared or used, instantly loses its value because it no longer contributes effectively to your thinking, your strategies and your decisions.

           

          Data Quality Best Practices

          To guarantee the integrity, the coherence, the accuracy, the validity and, in a word, the quality of your data, you must act with correct methodology. The essential step of an efficient Data Quality Management project is to avoid duplication. Beyond acting as a dead weight in your databases, duplicates distort analyses and can undermine the relevance of your decisions. 

          If you choose a Data Quality Management tool, make sure it includes a module that automates the exploitation of metadata. By centralizing all the knowledge you have about your data within a single interface, their exploitation is facilitated. This is the second pillar of your Data Quality Management project. 

          The precise definition of your data and their taxonomy, allows you to efficiently engage the quality optimization process. Then, once your data has been clearly identified and classified, it is a matter of putting it into perspective with the expectations of the various business lines within the company in order to assess its quality. 

          This work of reconciliation between the nature of the available data and its use by the business lines is a decisive element of Data Quality management. But it is also necessary to go further and question the sensitivity of the data. Whether or not the data is sensitive depends on your choices in relation to the challenge of regulatory compliance.

          Since the GDPR came to be in 2018, the consequences of risky choices in terms of data security are severe, and not only from a financial point of view. Indeed, your customers are now very sensitive to the nature, use and protection of the data they share with you. 

          By effectively managing Data Quality, you also contribute to maintaining trust with your customers… And customer trust is priceless!