Zeenea
  • Product
        • OVERVIEW

        • Platform overview
        • Data Catalog
        • Our connectors
        • APPLICATIONS

        • Zeenea Studio
        • Zeenea Explorer
        • GET STARTED

        • Team Edition
        • Pricing
        • Get a demo
        • metadata management navigation
  • Solutions
        • FOR DATA LEADERS

        • Chief Data Officers
        • Data Protection Officers
        • Data Stewards
        • Data Owners
        • Data Scientists
        • BY USE CASES

        • Data Stewardship
        • Business glossary
        • Agile Data Governance
        • Data Discovery
        • Cloud Transformation
        • Regulatory Compliance
        • data democracy nav
  • Resources
        • GETTING STARTED

        • Resources library
        • Events & News
        • Blog
        • SUPPORT

        • Customer Success
        • Professional Services
        • Training
        • data discovery display
  • Partners
  • Company
        • ZEENEA LIFE

        • Our story
        • Our offices
        • Contact us
        • OUR BELIEFS

        • Metadata Management
        • Becoming Data Fluent
        • Data Democratization
        • AGILE DATA GOVERNANCE
  • Get Started
  • English
    • Français
Select Page
Manufacturing Data Success Story: Total

Manufacturing Data Success Story: Total

by Zeenea Software | Jan 13, 2021 | Data inspiration

total-data

Total, one of the 7 “SuperMajor” oil companies, has recently opened their Digital Factory earlier this year in Paris. The Digital Factory will bring together up to 300 different profiles such as developers, data scientists and other digital experts to accelerate the Group’s digital transformation.

More specifically, Total’s Digital Factory aims to develop the digital solutions Total needs to improve its availability and cost operations in order to offer new services to their customers. Their priorities are mainly centered around the management and control of energy consumption, the ability to extend their reach to new distributed energies, as well as provide more environmentally friendly solutions. Total’s ambition is to generate $1.5 billion in value per year for the company by 2025.

During France’s Best Developer 2019 contest, Patrick Pouyanné, Chairman and Chief Executive Officer of Total, stated:

 “I am convinced that digital technology is a critical driver for achieving our excellence objectives across all of Total’s business segments. Total’s Digital Factory will serve as an accelerator, allowing the Group to systematically deploy customized digital solutions. Artificial intelligence (AI), the Internet of Things (IoT) and 5G are revolutionizing our industrial practices, and we will have the know-how in Paris to integrate them in our businesses as early as possible. The Digital Factory will also attract the new talent essential to our company’s future.”

 

Who makes up the Digital Factory teams?

In an interview with Forbes this past October, Frédéric Gimenez, Chief Digital Officer and Head of the project described how the teams will be structured within Digital Factory. 

As mentioned above, the team will have around 300 different profiles, all working using agile methodologies: managerial lines will be flattened, teams will have great autonomy and development cycles will be short in order to “test & learn” quickly and efficiently. 

Gimenez explains that there will be multiple teams in his Digital Factory:

  • Data Studio, which will consist of data scientists. Total’s CDO (Chief Data Officer) will be the one in charge of this team and their main missions will be to acculturate the enterprise to data and manage the data competences of the Digital Factory.
  • A pool of developers and agile coaches. 
  • Design Studio, that will regroup UX and UI professionals. They will help come up with various creative ideas and will interfere not only at the analysis stage of Total’s business projects but also during the customer journey stages.
  • A “Tech Authority” team, in charge of the security and architecture of their data ecosystem, in order to effectively transform their legacy in a digital environment.
  • A platform team, in charge of various data storages such as their Cloud environment, their data lake, etc.
  • A Product & Value office in charge of managing the Digital Factory portfolio, assessing the value of projects with the business and analyzing all the use cases submitted to the Digital Factory.
  • A HR & a general secretariat 
  • Product Owners that come from all over the world. They are trained in agile methods on arrival and then immersed in their project for 4 to 6 months. They then accompany the transformation when they return to their jobs. 

These teams will soon be reunited in a 5,500m2 workspace in the heart of Paris in the 2nd arrondissement, an open-space favorising creativity and innovation. 

How governance works at Total’s Digital Factory

Gimenez explained that the business lines are responsible for their Digital Factory use cases. The Digital Factory analyzes the eligibility of their use cases through four criteria:

  • Value brought during the 1st iteration and during its scaling up 
  • Feasibility (technology / data)
  • Customer Appetence / Internal Impact
  • Scalability 

An internal committee at the Digital Factory then decides whether or not the use case is taken care of and the final decision is validated by Gimenez himself.  For good coordination with the business lines, the digital representatives in the branches are also located in the Digital Factory. They are responsible for acculturating the business lines and piloting the generation of ideas, but also for ensuring the consistency of their branch’s digital initiatives with the Group’s ambitions, Total calls them Digital Transformation Officers.

 

First success of Total’s Digital Factory

Digital Factory started this past March and deployed the first squads in April during the Corona virus lockdown in France. In the Forbes interview, Gimenez explained that 16 projects are in progress with a target of 25 squads in permanent operation.

The first two digital solutions will be delivered by the end of this year:

  • A tool for Total Direct Energie to assist customers in finding the best payment schedule using algorithms and data
  • A logistics optimization solution based on IoT trucks for the Marketing and Services branch, which will be deployed in 40 subsidiaries.

 In addition, Total managed to attract experts such as data scientists (despite a still very limited form of communication such as Welcome to the Jungle or Linkedin) and retain them by offering a diversity of projects.

“We are currently carrying out a first assessment of what has worked and what needs to be improved, we are in a permanent adaptation process.” stated Gimenez.

 

Digital Factory in the future?

Gimenez ended the Forbes interview by saying that the main reason for his project’s success is the general mobilization that everyone kept despite the sanitary context: “We received more use cases than we are able to deliver (50 projects per year to continuously feed our 25 squads)!”

Otherwise Total has two major KPI sets:

. measuring the proper functioning of the squads by examining the KPIs of their agile methodologies

. tracking the value generated

 

Are you interested in unlocking data access for your company?

Are you in the manufacturing industry? Get the keys to unlocking data access for your company by downloading our new white paper “Unlock data for the manufacturing industry” 

manufacturing-white-paper-maquette-EN
Download our white paper
← Previous post Next post →
IoT in manufacturing: why your enterprise needs a data catalog

IoT in manufacturing: why your enterprise needs a data catalog

by Zeenea Software | Jan 12, 2021 | Data inspiration

iot-manufacturing-industry

Digital transformation has become a priority in organizations’ business strategies and manufacturing industries are no exception to the rule! With stronger customer expectations, increased customization demands, and the complexity of the global supply chain, manufacturers are in need to find new, more innovative products and services. In response to these challenges, manufacturing companies are increasingly investing in IoT (Internet of Things). 

In fact, the IoT market has boosted exponentially over the past few years. IDC reports the IoT footprint is expected to grow up to $1.2 trillion in 2022, and Statista, by way of contrast, is confident its economic impact may be between $3.9 and $11.1 trillion by 2025. 

In this article, we define what IoT is and some manufacturing-specific use cases as well as explain why a Zeenea Data Catalog is an essential tool for manufacturers to advance in their IoT implementations.

What is IoT?

A quick definition 

According to Tech Target, the internet of things, (IoT), “is a system of interrelated computing devices, mechanical and digital machines, objects, or people that are provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction.”

A “thing” in the IoT can therefore be a person with a heart monitor implant, an automobile that has built-in sensors to alert the driver when tire pressure is low or any other object that can be assigned an ID and is able to transfer data over a network.

From a manufacturing point of view, IoT is a way to digitize industry processes. Industrial IoT employs a network of sensors to collect critical production data and uses various software to turn this data into valuable insights about the efficiency of the manufacturing operations.

 

IoT use cases in manufacturing industries

Currently, many IoT projects deal with facility and asset management, security and operations, logistics, customer servicing, etc. Here is a list of examples of IoT use cases in manufacturing:

 Predictive maintenance

For industries, unexpected downtime and breakdowns are the biggest issues. Hence manufacturing companies realize the importance of identifying the potential failures, their occurrences and consequences. To overcome these potential issues, organizations now use machine learning for faster and smarter data-driven decisions.

With machine learning, it becomes easy to identify patterns in available data and predict machine outcomes. This works by identifying the correct data set, combining it with a machine to feed real-time data.This kind of information allows manufacturers to estimate the current condition of machinery, determine warning signs, transmit alerts and activate corresponding repair processes.

With Predictive maintenance through the use of IoT, manufacturers can lower the maintenance costs, lessen the downtime and extend equipment life, thereby enhancing quality of production by attending to problems before equipment fails. 

For instance, Medivators, one of the leading medical equipment manufacturers, successfully integrated IoT solutions across their service and experienced an impressive 78% boost of the service events that could be easily diagnosed and resolved without any additional human resources.

Asset tracking

IoT asset tracking is one of the fastest growing phenomena across manufacturing industries. It is expected that by 2027, there will be 267 million active asset trackers in use worldwide for agriculture, supply chain, construction, mining, and other markets. 

While in the past manufacturers would spend a lot of time manually tracking and checking their products, IoT uses sensors and asset management software to track things automatically. These sensors continuously or periodically broadcast their location information over the internet and the software then displays that information for you to see. This therefore allows manufacturing companies to reduce the amount of time they spend locating materials, tools, and equipment.

A striking example of this can be found in the automotive industry, where IoT has helped significantly in the tracking of data for individual vehicles. For example, Volvo Trucks introduced connected-fleet services that include smart navigation with real-time road conditions based on information from other local Volvo trucks. In the future, more real-time data from vehicles will help weather analytics work faster and more accurately; for example, windshield wiper and headlight use during the day indicate weather conditions. These updates can help maximize asset usage by rerouting vehicles in response to weather conditions.

Another tracking example is seen at Amazon. They are using WiFi robots to scan QR codes on its products to track and triage its orders. Imagine being able to track your inventory—including the supplies you have in stock for future manufacturing—at the click of a button. You’d never miss a deadline again! And again, all that data can be used to find trends to make manufacturing schedules even more efficient.

Driving innovation

By collecting and audit-trailing manufacturing data, companies can better track production processes and collect exponential amounts of data. That knowledge helps develop innovative products, services, and new business models. For example, JCDecaux Asia has developed their displaying strategy thanks to data and IoT. Their objective was to have a precise idea of the interest of the people for the campaigns they carried out, and to attract their attention more and more via animations. “On some screens, we have installed small cameras, which allow us to measure whether people slow down in front of the advertisement or not.”, explains Emmanuel Bastide, Managing Director for Asia at JCDecaux.

In the future, will displaying advertising be tailored to individual profiles? JCDecaux says that in airports, for example, it is possible to better target advertising according to the time of day or the landing of a plane coming from a particular country! By being connected to the airport’s arrival systems, the generated data can send the information to the displaying terminals, which can then display a specific advertisement for the arriving passengers. 

 

Data catalog: one way to rule data for any manufacturer

To enable advanced analytics, collect data from sensors, guarantee digital security and use machine learning and artificial intelligence, manufacturing industries need to “unlock data,” which means centralizing in a smart and easy-to-use corporate “Yellow Pages” of the data landscape. For industrial companies, extracting meaningful insights from data is made simpler and more accessible with a data catalog.

A data catalog is a central repository of metadata enabling anyone in the company to have access, understand and trust any necessary data to achieve a particular goal.

 

Zeenea data catalog x IoT: the perfect match

Zeenea helps industries build an end-to-end information value chain. Our data catalog allows you to manage a 360° knowledge base using the full potential of the metadata of your business assets.

Zeenea success story in the manufacturing industry

In 2017, Renault Digital was born with the aim of transforming the Renault Group into a data-driven company.  Today, this entity is made up of a community of experts in terms of digital practices, capable of innovating while delivering agile delivery and maximum value to the company’s business IT projects. In a conference in Zeenea’s Data Centric Exchange (French), Jean-Pierre Huchet, Head of Renault’s Data Lake, states that their main data challenges were: 

  • Data was too siloed,
  • Complicated data access,
  • No clear and shared definitions of data terms,
  • Lack of visibility on personal / sensitive data,
  • Weak data literacy.

By choosing Zeenea Data Catalog as their data catalog software, they were able to overcome these challenges and more. Zeenea today has become an essential brick in Renault Digital’s data projects. Its success can be translated into :

  • Its integration into Renault Digital’s onboarding: mastering the data catalog is part of their training program.
  • Resilient documentation processes & rules implemented via Zeenea.
  • Hundreds of active users. 

Now, Zeenea is their main data catalog, with Renault Digital’s objectives of having a clear vision of the data upstream and downstream of the hybrid data lake, a 360 degree view on the use of their data, as well as the creation of several thousands of Data Explorers. 

 

Zeenea’s unique features for manufacturing companies

At Zeenea, our data catalog has the following features to solve your problematics :

  • Universal connectivity to all technologies used by leading manufacturers
  • Flexible metamodel templates adapted to manufacturers’ use-cases
  • Compliance to specific manufacturing standards through automatic data lineage
  • A smooth transition in becoming data literate through compelling user experiences 
  • An affordable platform with a fast return on investment (ROI) 

Are you interested in unlocking data access for your company?

Are you in the manufacturing industry? Get the keys to unlocking data access for your company by downloading our new white paper “Unlock data for the manufacturing industry” 

manufacturing-white-paper-maquette-EN
Download our white paper
← Previous post Next post →
How has data impacted the manufacturing industry?

How has data impacted the manufacturing industry?

by Zeenea Software | Jan 11, 2021 | Data inspiration

iot-in-manufacturing

The place of data is – or should be – central to a manufacturing industry’s strategy. From production flows optimizations through predictive maintenance to customization, data exploitation is a major lever for transforming the industry. However, with great data comes great responsibilities! Here are some explanations.

The manufacturing industry is already on the way to becoming data-driven. In the 2020 edition of the Industry 4.0 Barometer, Wavestone reveals that 86% of respondents say they launched Industry 4.0 projects. From the deployment of IoT platforms, complete redesigns of historical IT architecture, movements towards the Cloud, data lake implementations… data is at the heart of manufacturing industry transformation challenges. 

“In 2020, we are starting to see more and more projects around data, algorithmics, artificial intelligence, machine learning, chatbots, etc.” Wavestone explains. 

All sectors are impacted by this transformation. According to Netscribes Market Research forecasts, the global automotive IoT market, for example, is expected to reach $106.32 billion by 2023. The driving force behind the adoption of data-driven strategies in the industry is the need for increased productivity at lower cost.

What are the data challenges in the manufacturing industry?

The use of data in the manufacturing industry is also a question of responding to a key issue: that of the mass-customization of production. A growing topic that particularly affects the automotive sector. Each consumer is unique and intends to have products that resemble them. However, in the past, manufacturing industries based their production methods on the volume of production and industry-specific standards! 

Mass-customization of production is, therefore, the lever of the data-driven revolution currently underway in the manufacturing industry. Nevertheless, other considerations come into play as well. A “smart” industrial tool makes it possible for these enterprises to reduce the costs and delays of production as well as respond to the general acceleration of the time-to-market. Data also contributes to meeting ecological challenges by reducing the production machines’ environmental footprint. 

Whether it is integrating IoT, Big Data, Business Intelligence, or Machine Learning, these technologies are all opportunities to reinvent a new data-based industry (embedded sensors, connected machines and products, Internet of Things, virtualization). 

But behind these perspectives, there are many challenges. The first of these is the extremely rigorous General Data Protection Regulation (GDPR) in application since May 2018 in Europe. The omnipresence of data in the industrial world has not escaped mafia organizations and cybercriminals who have been multiplying attacks on industry players’ IT infrastructures since 2017 with the infamous Wannacry ransomware. 

This attention is fueled by another difficulty in the industrial sector: older and legacy IT environments that are often described as technological hastles, multiplying potential vulnerabilities. The heterogeneity of data sources is another sensitive difficulty for the manufacturing industry. Marketing data, product data, logistics data, are often highly siloed and difficult to reconcile in real time.

The benefits of data for the manufacturing industry

According to the Wavestone Barometer statistics, 74% of the companies surveyed recorded tangible results within 2 years. Nearly 7 out of 10 companies (69%) report a reduction in costs, and 68% report an improvement in the quality of services, products or processes. 

On average, transformation programs regarding the creation or processing of data have led to the optimization of energy performance by 20 to 30% and a reduction in downtime from better monitoring of equipment that can reach up to 40% in some sectors. 

Increased traceability of operations and tools, real-time supervision of the operating conditions of production tools, all of which contribute to preventing errors, optimizing product tracking, but also to detecting new innovation levers related to the analysis of weak signals thanks to AI solutions for example. 

At the heart of the manufacturing industry’s transformation is the need to rely on data integration and management solutions that are powerful, stable and ergonomic, to accelerate the adoption of a strong data culture.

← Previous post Next post →

Marquez: the metadata discovery solution at WeWork

Marquez: the metadata discovery solution at WeWork

by Zeenea Software | Dec 10, 2020 | Data inspiration, Metadata management

marquez-blog-post-cover

Created in 2010, WeWork is a global office & workspace leasing company. Their objective is to provide space for teams of any sizes including startups, SMEs, and major corporations, to collaborate. In order to achieve this, what WeWork provides can be broken down into three different categories: 

 

    • Space: To ensure companies with optimal space, WeWork must provide the appropriate infrastructure, which consists of booking rooms for interviews / one on ones or even entire buildings for huge corporations. They also must make sure they are equipped with the appropriate facilities such as kitchens for lunch and coffee breaks, bathrooms, etc.     
    • Community: Via WeWork’s internal application, the firm enables WeWork members to connect with one another, whether it’s local within their own WeWork space, or globally. For example, if a company is in need of feedback for a project from specific job titles (such as a developer or UX designer), they can directly ask for feedback and suggestions via the application to any member, regardless of their location.  
    • Services: WeWork also provides their members with full IT services if there are any problems as well as other services such as payroll services, utility services, etc.

In 2020, WeWork represents:

  • More than 600,000 memberships 
  • Locations in 127 cities in 33 different countries,
  • 850 offices worldwide,
  • Generated $1.82 billion in revenue.

It is clear that WeWork works with all sorts of data from their staff and customers, whether that be individuals or companies. The huge firm was therefore in need of a platform where their data experts could view, collect, aggregate, and visualize their data ecosystem’s metadata. This was resolved by the creation of Marquez. 

This article will focus on WeWork’s implementation of Marquez mainly through free & accessible documentation provided on various websites, to illustrate the importance of having an enterprise-wide metadata platform in order to truly become data-driven.  

 

Why manage & utilize metadata?  

In his talk “A Metadata Service for Data Abstraction, Data Lineage & Event-based Triggers” at the Data Council back in 2018, Willy Lulciuc, Software Engineer for the Marquez project at WeWork explained that metadata is crucial for three reasons:

Ensuring data quality: when data has no context, it is hard for data citizens to trust their data assets: are there fields missing? Is the documentation up to date? Who is the data owner and are they still the owner? These questions are answered through the use of metadata.

Understanding Data lineage: knowing your data’s origins and transformations are key to being able to truly know what stages your data went through over time.

Democratization of datasets: According to Willy Lulciuc, democratizing data in the enterprise is critical! Having a central portal or UI available for users to be able to search for and explore their datasets is one of the most important ways companies can truly create a self-service data culture. 

marquez-why-manage-and-utilize-metadata

To sum up: creating a healthy data ecosystem! Willy explains that being able to manage and utilize metadata creates a sustainable data culture where individuals no longer need to ask for help to find and work with the data they need. In his slide, he goes through three different categories that make up a healthy data ecosystem:

  1. Being a self service ecosystem, where data and business users have the possibility to discover the data and metadata they need, and explore the enterprise’s data assets when they don’t know exactly what they are searching for. Providing data with context, gives the ability to all users and data citizens to effectively work on their data use cases.
  2. Being self-sufficient by enabling data users the freedom to experiment with their datasets as well as having the flexibility to work on every aspect of their datasets whether they input or output datasets for example.

  3. And finally, instead of relying on certain individuals or groups, a healthy data ecosystem allows for all employees to be accountable for their own data. Each user has the responsibility to know their data, their costs (is this data producing enough value?) as well as keeping track of their data’s documentation in order to build trust around their datasets. 
marquez-a-healthy-data-ecosystem

Room booking pipeline before

As mentioned above, utilizing metadata is crucial for data users to be able to find the data they need. In his presentation, Willy shared a real situation to prove metadata is essential: WeWork’s data pipeline for booking a room. 

 For a “WeWorker”, the steps are as follows:

  1. Find a location (the example was a building complex in San Francisco)
  2. Choose the appropriate room size (usually split into the number of attendees – in this case they chose a room that could greet 1 – 4 people)
  3. Choose the date for when the booking will take place
  4. Decide on the time slot the room is booked for as well as the duration of the meeting
  5. Confirm the booking

Now that we have an example of how their booking pipeline works, Willy proceeds to demonstrate how a typical data team would operate when wanting to pull out data on WeWork’s bookings. In this case, the example exercise was to find the building that held the most room bookings, and extract that data to send over to management. The steps he stated were the following:

  • Read the room bookings from a data source (usually unknown), 
  • Sum up all of the room bookings and return the top locations, 
  • Once the top location is calculated, the next step is to write it into some output data source,
  • Run the job once a hour,
  • Process the data through .csv files and store it somewhere.

However, Willy stated that even though these steps seem like it’s going to be good enough, usually, there are problems that occur. He goes over three types of issues during the job process:

  1. Where can I find the job input’s dataset?
  2. Does the dataset have an owner? Who is it? 
  3. How often is the dataset updated? 

Most of these questions are difficult to answer and jobs end up failing… Without being sure and trusting this information, it can be hard to present numbers to management ! These sorts of problems and issues are what made WeWork develop Marquez!

What is Marquez?

Willy defines the platform as an “open-sourced solution for the aggregation, collection, and visualization of metadata of [WeWork’s] data ecosystem”. Indeed, Marquez is a modular system and was designed as a highly scalable, highly extensible platform-agnostic solution for metadata management. It consists of the following components:

Metadata Repository: Stores all job and dataset metadata, including a complete history of job runs and job-level statistics (i.e. total runs, average runtimes, success/failures, etc).

Metadata API: RESTful API enabling a diverse set of clients to begin collecting metadata around dataset production and consumption.

Metadata UI: Used for dataset discovery, connecting multiple datasets and exploring their dependency graph.

Marquez’s design

Marquez provides language-specific clients that implement the Metadata API. This enables a  diverse set of data processing applications to build a metadata collection. In their initial release, they provided support for both Java and Python. 

The Metadata API extracts information around the production and consumption of datasets. It’s a stateless layer responsible for specifying both metadata persistence and aggregation. The API allows clients to collect and/or obtain dataset information to/from the Metadata Repository.

Metadata needs to be collected, organized, and stored in a way to allow for rich exploratory queries via the Metadata UI. The Metadata Repository serves as a catalog of dataset information encapsulated and cleanly abstracted away by the Metadata API.

According to Willy, what makes a very strong data ecosystem is the ability to search for information and datasets. Datasets in Marquez are indexed and ranked through the use of a search engine based keyword or phrase as well as the documentation of a dataset: the more a dataset has context, the more it is likely to appear first in the search results. Examples of a dataset’s documentation is its description, owner, schema, tag, etc. 

You can see more detail of Marquez’s data model in the presentation itself here → https://www.youtube.com/watch?v=dRaRKob-lRQ&ab_channel=DataCouncil 

marquez-data-model

The future of data management at WeWork

Two years after the project, Marquez has proven to be a big help for the giant leasing firm. They’re long term roadmap is to solely focus on their solution’s UI, by including more visualizations and graphical representations in order to provide simpler and more fun ways for users to interact with their data. 

They also provide various online communities via their Github page, as well as groups on LinkedIn for those who are interested in Marquez to ask questions, get advice or even report issues on the current Marquez version. 

Sources

A Metadata Service for Data Abstraction, Data Lineage & Event-based Triggers, WeWork. Youtube: https://www.youtube.com/watch?v=dRaRKob-lRQ&ab_channel=DataCouncil

29 Stunning WeWork Statistics – The New Era Of Coworking, TechJury.com:https://techjury.net/blog/wework-statistics/

Marquez: Collect, aggregate, and visualize a data ecosystem’s metadata, https://marquezproject.github.io/marquez/

Marquez: An Open Source Metadata Service for ML Platforms Willy Lulciuc
← Article précédent Article suivant →

Air France: Their Big Data strategy in a hybrid cloud context

Air France: Their Big Data strategy in a hybrid cloud context

by Zeenea Software | Oct 22, 2020 | Data inspiration

airfrance big data

Air France-KLM is the leading group in terms of international traffic departing from Europe. The airline is a member of the SkyTeam alliance which consists of 19 different airlines; offering access to a global network of more than 14,500 daily flights in over 1,150 destinations around the world. In 2019, Air France represented: 

  • 104.2 million passengers,
  • 312 destinations,
  • 119 countries,
  • 546 aircrafts,
  • 15 million members enrolled in their “Flying Blue” loyalty program*,
  • 2,300 flights per day*. 

At the Big Data Paris 2020, Eric Poutrin, Lead Enterprise Architect Data Management & Analytics at Air France, explained how the airline business works, what Air France’s Big Data structure started as, and how their data architecture is today in the context of a hybrid cloud structure.

air-france-big-data-paris-1-1

How does an airline company work?

Before we start talking about data, it is imperative to understand how an airline company works from the creation of its flight path to its landing. 

Before planning a route, the first step for an airline such as Air France is to have a flight schedule. Note that in times of health crises, they are likely to change quite frequently. Once the flight schedule is set up, there are three totally separate flows that activate for a flight to have a given departure date and time: 

  • the flow of passengers, which represents different forms of services to facilitate the traveler’s experience along the way, from buying tickets on their various platforms (web, app, physical) to the provision of staff or automatic kiosks in various airports to help travelers check in, drop off their luggage, etc.

  • the flow of crew management, with profiles adapted to the qualifications required to operate or pilot the aircraft, as well as the management of flight attendant schedules.

  •  the engineering flow which consists of getting the right aircraft with the right configuration at the right parking point.
air-france-big-data-paris-2

However, Eric tells us that all this… is in an ideal world: 

“The “product” of an airline goes through the customer, so all of the hazards are visible. And, they all impact each other’s flows! So the closer you get to the date of the flight, the more critical these hazards become.”

Following these observations, 25 years ago now, Air France decided to set up a “service-oriented” architecture, which allows, among other things, the notification of subscribers in the event of hazards on any flow. These real-time notifications are pushed either to agents or passengers according to their needs: prevention of technical difficulties (an aircraft breaking down), climate hazards, prevention of delays, etc.

“The objective was to bridge the gap between a traditional analytical approach and a modern analytical approach based on omni-present, predictive and prescriptive analysis on a large scale” affirmed Eric.

Air France’s Big Data journey

air-france-big-data-paris-3

The timeline

In 1998, Air France began their data strategy by setting up an enterprise data warehouse on the commercial side, gathering customer, crew and technical data that allowed the company’s IT teams to build analysis reports. 

Eric tells us that in 2001, following the SARS (Severe Acute Respiratory Syndrome) health crisis, Air France had to redeploy their aircrafts following the ban on incoming flights to the United States. It was the firm’s data warehouse that allowed them to find other sources of revenue, thanks to their machine learning and artificial intelligence algorithms. This way of working with data had worked well for 10 years and even allowed the firm to overcome several other difficulties, including the tragedy of September 11, 2001 and the crisis of rising oil prices. 

In 2012, Air France’s data teams decided to implement a Hadoop platform in order to be able to perform predictive or prescriptive analysis (depending on individual needs) in real time, as the data warehouse no longer met these new needs and the high volume of information that was to be managed. It was only in a few months after the implementation of Hadoop, KAFKA, and other new-generation technologies that the firm was able to obtain much “fresher” and more relevant data. 

Since then, the teams have been constantly improving and optimizing their data ecosystem in order to stay up to date with new technologies and thus, allow data users to work efficiently with their analysis.

Air France’s data challenges

During the conference, Eric also presented the firm’s data challenges in the implementation of a data strategy:

  • Delivering a reliable analytics ecosystem with quality data,
  • Implementing technologies adapted for all profiles and their use cases regardless of their line of business,
  • Having an infrastructure that supports all types of data in real time. 

Air France was able to resolve some of these issues with the implementation of a robust architecture (which notably enabled the firm to withstand the COVID-19 crisis), as well as the setting up of dedicated teams, the deployment of applications and the security structures, particularly regarding the GDPR and other pilot regulations. 

However, Air France KLM has not finished working to meet their data challenges. With ever-increasing volumes of data, the number of data and business users growing, managing data flows across the different channels of the enterprise and managing data is a constant work of governance:

“We must always be at the service of the business, and as people and trends change, it is imperative to make continuous efforts to ensure that everyone can understand the data”.

air-france-big-data-paris-4

Air France’s unified data architecture

The Unified Data Architecture (UDA) is the cornerstone of Air France. Eric explains that there are four types of platforms:

The data discovery platform 

Separated into two different platforms, they are the applications of choice for data scientists and citizen data scientists. They allow, among other things, to :

  • extract the “knowledge” from the data,
  • process unstructured data, (text, images, voice, etc.)
  • have predictive analytics support to understand customer behaviors

A data lake 

Air France’s data lake is a logical instance and is accessible to all the company’s employees, regardless of their profession. However, Eric specifies that the data is well secured: “The data lake is not an open bar at all! Everything is done under the control of the data officers and data owners“. The data lake :

  • stores structured and unstructured data,
  • combines the different data sources from various businesses,
  • provides a complete view of a situation, a topic or a data environment,
  • is very scalable.

“Real Time Data Processing” platforms 

To operate the data, Air France has implemented 8 real-time data processing platforms to meet the needs of each “high priority” business use case.  For example, they have a platform for predictive maintenance, customer behavior knowledge, or process optimization on stopovers.

Eric confirms that when an event or hazard occurs, their platform is able to push recommendations in “real time” in just 10 seconds!

Data Warehouses 

As mentioned above, Air France had also already set up data warehouses to store external data such as customer and partner data and data from operational systems.  These Data Warehouses allow users to query these datasets in complete security, and are an excellent communication vector to explain the data strategy between the company’s different business lines.

air-france-big-data-paris-5

The benefits of implementing a Hybrid Cloud architecture

Air France’s initial questions regarding the move to the Cloud were :

  • Air France KLM aims to standardize its calculation and storage services as much as possible.
  • Not all data is eligible to leave Air France’s premises due to regulations or sensitive data.
  • All the tools already used in UDA platforms are available both on-premise and in the public cloud.

Éric says that a hybrid Cloud architecture would allow the firm to have more flexibility to meet today’s challenges:

“Putting our UDA on the Public Cloud would give greater flexibility to the business and more options in terms of data deployment.”

According to Air France, here is the checklist of best practices before migrating to a Hybrid Cloud:

  • check if the data has a good reason to be migrated to the Public Cloud
  • check the level of sensitivity of the data (according to internal data management policies)
  • verify compliance with the UDA implementation guidelines
  • verify data stream designs
  • configure the right network connection
  • for each implementation tool, choose the right level of service management
  • for each component, evaluate the locking level and exit conditions
  • monitor and forecast possible costs
  • Adopt a security model that allows Hybrid Cloud security to be as transparent as possible.
  • Extend data governance in the Cloud

Where is Air France today? 

It’s clear that the COVID-19 crisis has completely changed the aviation sector. Every day, Air France has to take the time to understand new passenger behavior and adapt flight schedules in real time, in line with the travel restrictions put in place by various governments. By the end of summer 2020, Air France will have served nearly 170 destinations, or 85% of their regular network. 

Air France’s data architecture has therefore been a key catalyst for the recovery of their airlines:

“a huge thanks to our business users (data scientists) who every day try to optimize services in real time so that they can understand how passengers are behaving in the midst of a health crisis. Even if we are working on artificial intelligence, the human factor is still an essential resource in the success of a data strategy”. 

← Article précédent Article suivant →
What is the difference between a Data Owner and a Data Steward?

What is the difference between a Data Owner and a Data Steward?

by Zeenea Software | Oct 12, 2020 | Data governance, Data inspiration

data-steward-vs-data-owner

What is the difference between a data steward and a data owner? This question comes up over and over again!

There are many different definitions associated with data management and data governance on the internet. Moreover, depending on the company, their definitions and responsibilities can vary significantly. To try and clarify the situation, we’ve written this article to shed light on these two profiles and establish a potential complementarity.

Above all, we firmly believe that there is no idyllic or standard framework. These definitions are specific to each company because of their organization, culture, and their “legacy”.

Data owners and data stewards: two roles with different maturities

The recent appointment of CDOs was largely driven by the digital transformations undertaken in recent years: mastering the data life cycle from its collection to its value creation. To try to achieve this, a simple – yet complex – objective has emerged: first and foremost, to know the company’s information assets, which are all too often siloed. 

Thus, the first step for many CDOs was to reference these assets. Their mission was to document them from a business perspective as well as the processes that have transformed them, and the technical resources to exploit them. 

This founding principle of data governance was also evoked by Christina Poirson, CDO of Société Générale during a roundtable discussion at Big Data Paris 2020. She explained the importance of knowing your data environment and the associated risks to ultimately create value. During her presentation, Christina Poirson developed the role of the Data Owner and the challenge of sharing data knowledge. Part of the business roles, they are responsible for defining their datasets as well as their uses and their quality level, without questioning the Data Owner:

“The data in our company belongs either to the customer or to the whole company, but not to a particular BU or department. We manage to create value from the moment the data is shared”.  

It is evident that the role of “Data Owner” has been present in organizations longer than the “Data Steward” has. They are stakeholders in the collection, accessibility and quality of datasets. We qualify a Data Owner as being the person in charge of the final data. For example, a marketing manager can undertake this role in the management of customer data. They will thus have the responsibility and duty to control its collection, protection and uses.

More recently, the democratization of data stewards has led to the creation of dedicated positions in organizations. Unlike a Data Owner and manager, the Data Steward is more widely involved in a challenge that has been regaining popularity for some time now: data governance.

In our articles, “Who are data stewards” and “The Data Steward’s multiple facets“, we go further into explaining about this profile, who are involved in the referencing and documenting phases of enterprise assets (we are talking about data of course!) to simplify their comprehension and use.

Data steward and data owners: two complementary roles?

In reality, companies do not always have the means to open new positions for Data Stewards. In an ideal organization, the complementarity of these profiles could tend towards :  

A data owner is responsible for the data within their perimeter in terms of its collection, protection and quality. The data steward would then be responsible for referencing and aggregating the information, definitions and any other business needs to simplify the discovery and understanding of these assets.

Let’s take the example of the level of quality of a dataset. If a data quality problem occurs, you would expect the Data Steward to point out the problems encountered by its customers to the Data Owner, who is then responsible for investigating and offering corrective measures.

To illustrate this complementarity, Chafika Chettaoui, CDO at Suez – also present at the Big Data Paris 2020 roundtable – confirms that they added another role in their organization: the Data Steward. According to her and Suez, the Data Steward is the person who makes sure that the data flows work. She explains:

“The Data Steward is the person who will lead the so-called Data Producers (the people who collect the data in the systems), make sure they are well trained and understand the quality and context of the data to create their reporting and analysis dashboards. In short, it’s a business profile, but with real data valence and an understanding of data and its value”. 

To conclude, there are two notions regarding the differentiation of the two roles: the Data Owner is “accountable for data” while the Data Steward is “responsible for” the day-to-day data activity.

← Previous post Next post →
How to deploy effective data governance, adopted by everyone

How to deploy effective data governance, adopted by everyone

by Zeenea Software | Oct 8, 2020 | Data governance, Data inspiration

big-data-paris-table-ronde-CDO-1

It is no secret that the recent global pandemic has completely changed the way people do business. In March 2020, France was placed in total lockdown, and many companies had to adapt to new ways of working, whether that be by introducing remote working, changing the production agenda, or even shutting down the organization’s operations completely. This health crisis had companies ask themselves: how are we going to deal with the financial, technological and compliance risks following COVID-19?

At Big Data Paris 2020, we had the pleasure to attend the roundtable “How to deploy effective data governance that is adopted by everyone” led by Christina Poirson, CDO of Société Générale, Chafika Chettaoui, CDO of the Suez Group and Elias Baltassis, Partner & Director, Data & Analytics of the Boston Consulting Group. In this roundtable of approximately 35 minutes, the three data experts explain the importance and the best practices of implementing data governance. 

 

First steps to implementing data governance

The impact of Covid-19 has not been without underlining the essential challenge of knowing, collecting, preserving and transmitting quality data. So, has the lockdown pushed companies to want to put in place a data governance strategy? This first question, answered by Elias Baltassis, confirms the strong increase in demand for implementing data governance in France:

“The lockdown certainly accelerated the demand for implementing data governance! It was already a topic for the majority of these companies long before the lockdown, but the health crisis has of course pushed companies to strengthen the security and reliability of their data assets”.

So what is the objective of data governance? And where do you start? Elias explains that the first thing to do is to diagnose the data assets in the enterprise, and identify the sticking points: “Identify the places in the enterprise where there is a loss of value because of poor data quality. This is important because data governance can easily drift into a bureaucratic exercise, which is why you should always keep as a “guide” the value created for the organization, which translates into better data accessibility, better quality, etc”. 

Once the diagnosis is done and the sources of value are identified, Elias explains that there are four methodological steps to follow:

  1. Know your company’s data, its structure, and who owns it (via a data glossary for example),
  2. Set up a data policy targeted at the points of friction,
  3. Choose the right tool to deploy these policies across the enterprise
  4. Establish a data culture within the organization, starting with hiring data-driven people, such as Chief Data Officers. 

The above methodology is therefore essential before starting any data governance project which, according to Elias, can be implemented fairly quickly: “Data governance can be implemented quickly, but increasing data quality will take more or less time, depending on the complexity of the business; a company working with one country will take less time than a company working with several countries in Europe for example”. 

big-data-paris-table-ronde-CDO-3

The role of the Chief Data Officer in the implementation of data governance

Christina Poirson, explains that for her and Société Générale, data governance played a very important role during this exceptional period: “Fortunately, we had data governance in place that ensured the quality and protection of data during lockdown to our professional and private customers. We realized the importance of the couple digitization and data, which has been vital not only for our work during the crisis, but also for tomorrow’s business.”

So how did a company as large, old and with thousands of data records as Société Générale implement a new data governance strategy? Christina explains that data at Société Générale is not a recent topic. Indeed, since the very beginnings, the firm has been asking for information about the client in order to be able to advise them on what type of loan to put in place, for example. 

However, Société Générale’s CDO tells us that today, with digitization, there are new types, formats and volumes of data. It confirms what Elias Baltassis said just before: “The implementation of a data office and Chief Data Officers was one of the first steps in the company’s data strategy. Our role is to maximize the value of data while respecting the protection of sensitive data, which is very important in the banking world!”.

To do this, Christina explains that Société Générale supports this strategy throughout the data’s life cycle: from its creation to its end, including its qualification, protection, use, anonymization and destruction.

On the other hand, Chafika Chettaoui, CDO of the Suez group, explains that she sees herself as a conductor:

“What Suez lacked was a conductor who had to organize how IT can meet the business objectives. Today, with the increasing amount of data, the CDO has to be the conductor for the IT, business, and even HR and communication departments, because data and digital transformation is above all a human transformation. They have to be the organizer to ensure the quality and accessibility of the data as well as its analysis”.

But above all, the two speakers agreed that a CDO has two main missions:

  • The implementation of different standards on data quality and protection,
  • Must break down data silos by creating a common language around data, or data fluency, in all parts of the enterprise.

Data acculturation in the enterprise

We don’t need to remind you that building a data culture within the company is essential to create value with its data. Christina Poirson explains that data acculturation was quite a long process for Société Générale: 

“To implement data culture, we went through what we call “data mapping” at all levels of the managerial structure, from top management to employees. We also had to set up coaching sessions, coding training and other dedicated awareness sessions. We have also made available all the SG Group’s use cases in a catalog of ideas so that every company in the group can be inspired: it’s a library of use cases that is there to inspire people”. 

She goes on to explain that they have other ways of acculturating employees at Société Générale:

  • Setting up a library of algorithms to reuse what has already been set up
  • Implementing specific tools to assess whether the data complies with the regulations.
  • Making data accessible through a group data catalog

Data acculturation was therefore not an easy task for Société Générale. But, Christina remains positive and tells us a little analogy: “Data is like water, CIOs are the pipes, and businesses make demands related to water. There must therefore be a symbiosis between the IT, CIO and the business departments”. 

Chafika Chettaoui adds: “Indeed, it is imperative to work with and for the business. Our job is to appoint people in the business units who will be responsible for their data.  We have to give the responsibility back to everyone: the IT for building the house, and the business for what we put inside. By putting this balance in place, there are back and forth exchanges and it is not just the IT’s responsibility”.

big-data-paris-table-ronde-CDO-2

Roles in Data Governance

Although roles and responsibilities vary from company to company, in this roundtable discussion, the two Chief Data Officers explain how role allocation works within their data strategy. 

At Société Générale they have fairly strong convictions. First of all, they set up “Data Owners”, who are part of the business, who are responsible for :

  • the definition of their data
  • their main uses
  • their associated quality level

On the other hand, if a data user wants to use that data, they don’t have to ask permission from the Data Owner, otherwise the whole system is corrupt. As a result, Société Générale has put in place measures to ensure that they check compliance rules and regulations, without calling the Data Owner into question: “the data at Société Générale belongs either to the customer or to the whole company, but not to a particular BU or department. We manage to create value from the moment the data is shared”.  

At Suez, Chafika Chettaoui confirms that they have the same definition of Data Owner, but she adds another role, that of the Data Steward. At Suez, the Data Steward is the one who is on site, making sure that the data flows work.

She explains: “The Data Steward is someone who will animate the so-called Data Producers (the people who collect the data in the systems), make sure they are well trained and understand the quality of the data, as well as be the one who will hold the dashboards and analyze if there are any inconsistencies. It’s someone in the business, but with a real data valency and an understanding of the data and its value”. 

What are the key best practices for implementing data governance?

What should never be forgotten in implementing data governance is to remember that data does not belong to one part of the organization but must be shared to all. It is therefore imperative to standardize the data. To do this, Christina Poirson explains the importance of a data dictionary: “by adding a data dictionary that includes the name, definition, data owner, and quality level of the data, you already have a first brick in your governance”. 

As mentioned above, the second good practice in data governance is to define roles and responsibilities around data. In addition to a Data Owner or Data Steward, it is essential to define a series of roles to accompany each key stage in the use of the data. Some of these roles can be :

  • Data Quality Manager
  • Data Protection Analyst
  • Data Usages Analyst 
  • Data Analyst
  • Data Scientist
  • Data Protection Officer
  • etc

As a final best practice recommendation for successful data governance, Christina Poirson explains the importance of knowing your data environment, as well as your risk appetency, the rules of each business unit, industry and service to truly facilitate data accessibility and compliance. 

…and the mistakes to avoid?

To end the roundtable, Chafika Chettaoui talks about the mistakes to avoid in order to succeed in governance. According to her, we must not start with technology. Even if, of course, technology and expertise are essential to implementing data governance, it is very important to focus first on the culture of the company. 

She states: “Establishing a data culture with training is essential. On the one hand we have to break the myth that data and AI are “magical”, and on the other break the myth of “intuition” of some experts, by explaining the importance of data in the enterprise. The cultural aspect is key, and at any level of the organization. ” 

← Article précédent Article suivant →
Retail 4.0: How Monoprix migrated to the Cloud

Retail 4.0: How Monoprix migrated to the Cloud

by Zeenea Software | Oct 1, 2020 | Data inspiration

monoprix

Omni-channel leader with a presence in more than 250 cities in France, Monoprix, french retail chain, offers varied innovative products and services every day with a single objective in mind: “making the good and the beautiful accessible to all”. 

The company’s stores combine food retailing with hardware, clothing, household items and gifts. To give some stats on the firm, Monoprix in 2020 is : 

  • Nearly 590 stores in France,
  • 22,000 employees,
  • Approximately 100 stores internationally,
  • 800,000 customers per day,
  • 466 local partner producers.

With close to one million customers in store and more than 1.5 million users on their website each day, it’s no secret that Monoprix has hundreds of thousands of data to manage! Whether it’s from loyalty cards, customer receipts or online delivery orders, the company has to manage a huge amount of data in a variety of formats. 

At Big Data Paris 2020, Damien Pichot, Director of Operations and Merchandise Flows at Monoprix, shared with us the company’s journey in implementing a data-driven culture thanks to the Cloud.  

big-data-paris-monoprix-1

Big Data at Monoprix

In response to the amount of data that was coming into Monoprix’s data systems every day, the company had implemented various technologies: an on-premise data warehouse for structured data and a data lake in the cloud, which was used to manage the semi-structured data coming from their websites. In addition, a lot of data also comes from partners or service providers, in the context of information exchanges and acquisitions.

Despite the fact that the architecture had been working well and fulfilling its role for many years, it was beginning to show its limitations and weaknesses: 

“To illustrate, every Monday, our teams gather and analyze the profits made and everything that happened the previous week. As time went by, we realized that each week the number of users logging in to our information systems was increasing and we were reaching saturation. In fact, some of our employees would have to get up at 5am to launch their queries, only to retrieve it that day in the late morning or early afternoon,” explains Damien Pichot. 

Another negative aspect of the company’s IT structure was regarding their business users, and more specifically the marketing users. They were beginning to develop analytical environments outside the control of the IT department, thus creating what is known as “shadow IT”.  The Monoprix data teams were obviously dissatisfied because they had no supervision over the business projects. 

“The IT department represented within Monoprix was therefore not at the service of the business and did not meet its expectations”. 

After consulting the IT committee, Monoprix decided to break off their contract with their large on-premise structure. The new solution had to answer four questions:

  • Does the solution allow business users to be autonomous? 
  • Is the service efficient / resilient?
  • Will the solution lower operating costs?
  • Will users have access to a single platform that will enable them to extract all the data from the data warehouse and the data lake in order to meet business, decision-making, machine learning and data science challenges? 

After careful consideration, Monoprix finally decided to migrate everything to the Cloud! “Even if we had opted for another big on-premise solution, we would have faced the same problems at some point. We might have gained two years, but that’s not viable in the long term.” 

Monoprix’s journey to the Cloud

Monoprix started this new adventure in the Cloud with Snowflake! Only a few months after its implementation, Monoprix quickly saw improvements  compared to their previous architecture. Snowflake was also able to meet their needs in terms of data sharing, which is something they were struggling to do before, as well as robustness and data availability.

The first steps

During the conference, Damien Pichot explained that it was not easy to convince Monoprix teams that a migration to the Cloud was secure. They were reassured with the implementation of Snowflake, which carries out a level of security as high as that of the pharmaceutical and banking industries in the United States. 

To give themselves all the means possible to make this project a success, Monoprix decided to create a dedicated team, made up of numerous people such as project managers, integrators, managers of specific applications, etc. The official launch of the project took place in March 2019. 

Damien Pichot had organized a kickoff, inviting all the company’s business lines: “I didn’t want it to be an IT project but a company project, I am convinced that this project should be driven by the business lines and for the business lines”. 

Damien tells us that the day before the project was launched, he had trouble sleeping! Indeed, Monoprix is the first French company to embark on the total migration of an on-premise data warehouse to the Cloud! 

big-data-paris-monoprix-2

The challenges of the project 

The migration was done in an iterative way, due to a strong technical legacy, because everything needed to be reintegrated in a technology as modern as Snowflake. Indeed, Monoprix had big problems with its connectors: “We thought at the time that the hardest part of the project would be to automate the data processing. But the most complicated part was to replatform our ETLs in a new environment. So we went from a 12-month project to a 15-month project.“

The new architecture 

Monoprix therefore handles two types of data: structured and semi-structured data. The structured data comes from their classic data warehouse, which contains data from the Supply Chain, Marketing, customer transactions, etc. And the semi-structured data that comes from website-related events. All of this is now converged via ETLs into a single platform running on Azure with Snowflake. “Thanks to this new architecture in the Cloud we can attack the data we want via different applications,” says Damien.

big-data-paris-monoprix-3

Conclusion: Monoprix is better in the Cloud

Since May 2020, Monoprix has been managing its data in the Cloud, and it’s been “life changing”. On the business side, there is less latency, queries that used to take hours now take minutes, (and employees are finally sleeping in the morning!). Business analyses are also much deeper, with the possibility of making analyses over five years, which was not possible with the old IT structure. But the most important point is the ability to easily share data with the firm’s partners and service providers.

Damien proudly explains.  “With the old structure, our marketing teams took 15 days to prepare the data and had to send thousands of files to our providers, today they connect in a few minutes and they fetch the data alone, without us having to intervene. That alone is a direct ROI.“ 

← Article précédent Article suivant →
The Most used Data Management Solutions in 2020

The Most used Data Management Solutions in 2020

by Zeenea Software | Jul 27, 2020 | Data inspiration

data-management-solutions-2020

It is no secret that after various articles from Gartner, and other famous data and analytics consulting firms that data catalogs are an essential data management solution. Combining artificial intelligence and human skills,  Data Catalogs provide a next-generation workspace for data teams to find, understand and collaborate on their data assets and usages. 

In this article, we focus on the most-used data management solutions to which your enterprise can successfully collaborate with your data catalog software. These vendors have been repeatedly quoted by Gartner and used by many enterprises worldwide. We list the 5 main vendors in the following categories:

  • Data Integration
  • Data Preparation
  • Data Visualization
  • Data Governance

Let’s discover this list:

1. Data Integration Vendors

Data integration is the process of combining data from different sources, typically for analysis, business intelligence, reporting, or loading into an application. Data integration tools should be designed to transform, map, and clean data. They can also be integrated with data governance and data quality tools.

The top data integration vendors of 2020 include:

logo-informatica

Informatica Data Integration Hub

Informatica’s data integration tools portfolio includes both on-prem and cloud deployments. The vendor combines advanced hybrid integration and governance functionalities with self-service business access for various analytic functions. Informatica touts strong interoperability between its growing list of data management software products.

IBM-logo

IBM Infosphere Information Server

IBM offers several distinct data integration tools also both on-prem and cloud deployments, and for virtually every enterprise use case. Its on-prem data integration suite features tools for traditional and modern integration requirements. IBM also offers a variety of prebuilt functions and connectors. The mega-vendor’s cloud integration product is widely considered one of the best in the marketplace, and additional functionality is coming in the months ahead.

SAS_logo

SAS Data Management

SAS is one of the largest independent vendor in the data integration tools market. The provider offers its core capabilities via SAS Data Management, where data integration and quality tools are interwoven. It includes query language support, metadata integration, push-down database processing, and various optimization capabilities. 

SAP-Logo

SAP Data Services

SAP provides on-prem and cloud integration functionality through two main channels. Traditional capabilities are offered through SAP Data Services, a data management platform that provides capabilities for data integration, quality, and cleansing. Integration Platform as a Service features are available through the SAP Cloud Platform.

oracle-logo

Oracle Data Integration Cloud Service

Oracle offers a full spectrum of data integration tools for traditional use cases as well as modern ones, in both on-prem and cloud deployments. The company’s product portfolio features technologies and services that allow organizations to full lifecycle data movement and enrichment. 

2. Data Preparation Vendors

As defined in our last article on data preparation, is the process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics and machine learning applications. In other words, it is the process of cleaning and transforming raw data prior to analysis.

The top data preparation vendors of 2020 include:

Alteryx-logo

Alteryx Designer

Alteryx Designer features an intuitive user interface that enables users to connect and cleanse data from data warehouses, cloud applications, spreadsheets, and other sources. Users can leverage data quality, integration and transformation features as well. 

talend-logo

Talend Data Preparation

Talend Data Preparation utilizes machine learning algorithms for standardization, cleansing, and pattern recognition. The product also provides automated recommendations to guide users through the data preparation process. 

IBM-logo

IBM Watson Analytics

Together with IBM Watson Machine Learning, IBM Watson Studio is a leading data science and machine learning platform built from the ground up for an AI-powered business. It helps enterprises scale data science operations across the lifecycle–simplifying the process of experimentation to deployment, speeding up data exploration and preparation, as well as model development and training.

IBM-logo

Tableau Prep

Tableau Prep empowers more people to get to analysis faster by helping them quickly and confidently combine, shape, and clean their data. A direct and visual experience gives customers a deeper understanding of their data and smart features make data preparation simple.

Trifacta-Logo-Vert-RGB-2016-e1473200499615

Trifacta

Trifacta has been ranked as the top vendor in every analyst report published on data preparation to date. A self-service data preparation tool, Trifacta empowers all users, technical or non-technical, to clean & prepare their data efficiently. 

3. Data Visualization Vendors

Data visualization is defined as a graphical representation of data. It is used to help people understand the context and significance of their information by showing patterns, trends and correlations that may be difficult to interpret in plain text form.

The top data visualization vendors of 2020 include:

Trifacta-Logo-Vert-RGB-2016-e1473200499615

Tableau

Tableau is a data visualization tool that can be used by data analysts, scientists, statisticians, etc. to visualize the data and get a clear opinion based on the data analysis. Tableau is known for being able to take in data and produce the required data visualization output in a very short time. And it can do this while providing the highest level of security with a guarantee to handle security issues as soon as they arise or are found by users.

Logo_Color_Looker

Looker

Looker data visualization can go in-depth in the data and analyze it to obtain useful insights. It provides real-time dashboards of the data for more in-depth analysis so that businesses can make instant decisions based on the data visualizations obtained. Looker also provides connections with Redshift, Snowflake, BigQuery, as well as more than 50 SQL supported dialects so you can connect to multiple databases without any issues.

zoho-analytics-logo

Zoho Analytics

Zoho Analytics helps you create wonderful looking data visualizations based on your data in a few minutes. You can obtain data from multiple sources and mesh it together to create multidimensional data visualizations that allow you to view your business data across departments. In case you have any questions, you can use Zia which is a smart assistant created using artificial intelligence, machine learning, and natural language processing.

sisense-logo

Sisense

Sisense provides various tools that allow data analysts to simplify complex data and obtain insights for their organization and outsiders. The solution tries its best to provide various data analytics tools to business teams and data analytics so that they can help make their companies the data-driven companies of the future.

IBM-logo

IBM Cognos Analytics

IBM Cognos Analytics is an Artificial Intelligence-based business intelligence platform that supports data analytics. You can visualize as well as analyze your data and share actionable insights with anyone in your organization. Even if you have limited or no knowledge about data analytics, you can use IBM Cognos Analytics easily as it interprets the data for you and presents you with actionable insights in plain language.

4. Data Governance Vendors

We like to define data governance as an exercise of authority over decision-making power (planning, surveillance, and enforcement of rules) and the controls on data management.

In other words, it allows the clear documentation of the different roles and responsibilities around data as well as determining the procedures and the tools supporting data management within an organization.

cloudera_logo_darkorange

Cloudera Data Platform

Cloudera Data Platform (CDP) combines the best of Hortonworks’ and Cloudera’s technologies to deliver the industry’s first enterprise data cloud. CDP delivers powerful self-service analytics across hybrid and multi-cloud environments, along with sophisticated and granular security and governance policies that IT and data leaders demand.

logo-stealthbits

Stealthbits

Stealthbits’ Data Access Governance solution discovers where your data lives and then classifies, monitors, and remediates the conditions that make managing data access so difficult in the first place. The result is effective governance that promotes security, compliance, and operational efficiency.

Varonis_Logo_FullColor_RGB

Varonis

Varonis gives you the enterprise-wide visibility you need for effective discovery, auditing, and compliance reporting across a wide variety of regulatory standards. It quickly and accurately classifies sensitive, regulated information stored in on-premises and cloud data stores. Their classification engine prioritizes scans based on risk & exposure to give you actionable results quickly, no matter how much data you have.

logo-informatica-12

Informatica

Informatica provides a quick fix for compliance and data governance which can be implemented on-premise or in the cloud. It offers strong visualization of data lineage and history, master data dashboards for proactive monitoring of data quality and dynamic masking for data security. It also provides the functionality to detect and protect sensitive customer data, managing GDPR data risks, and ensuring contact information is current, accurate and complete.

And finally…

zeenea logo

Zeenea

Our data catalog centralizes all data knowledge in a single and easy-to-use interface. Automatically imported, generated, or added by the administrator, data specialists are able to enrich their data assets documentation directly within our tool.

Give meaning to your data thanks to metadata!

If you are interested in getting more information, getting a free personalized demo, or just want to say hi, do not hesitate to contact our team who will get back to you as soon as we’ve received your request 🙂

get a free demo
← Previous post Next post →

Gartner’s top Data & Analytics trends in 2020

Gartner’s top Data & Analytics trends in 2020

by Zeenea Software | Jul 16, 2020 | Data inspiration, Metadata Management

data-trends

The recent global pandemic has left many organizations in an uncertain and fragile state. It is therefore a fundamental requirement for enterprises to keep pace with data and analytics trends in order to bounce back from the crisis and gain competitive advantage.

From crisis to opportunity, the role of data and analytics is expanding and becoming more strategic and critical. Society in general is becoming increasingly more digital, complex, global with ever growing competition and emancipated customers. Massive disruption, crisis and the ensuing economic downturn are forcing companies to respond to previously unimaginable demands to resource optimize, reinvent processes, and rethink products, business models and even their very purpose.

It is therefore obvious that Data & Analytics is central for enterprises navigating their way out the devastating effects of these crisis, however, the lack of trust and access to data has never been a greater challenge.

Success at scale for maximum business impact with data & analytics depends more than ever on building a foundation of trust, security, governance, and accountability.

We share in this article, the current Data & Analytics trends to help your business thrive:

 

#1 – The use of new AI techniques

By the end of 2024, 75% of enterprises will shift from piloting to operationalizing AI, driving 5X increase in streaming data and analytics infrastructures.

Within the current context, AI techniques such as machine learning, optimisation and natural language processing are providing vital insights and predictions about the spread of the virus and the effectiveness and impact of countermeasures. With the more commercial use of AI, organizations are discovering new and smarter techniques, including reinforcement learning and distributed learning, interpretable systems, and efficient infrastructures that handle their own complex business situations. 

 

#2 – Less Dashboards

By 2025, data stories will be the most widespread way of consuming analytics, and 75% of stories will be automatically generated using augmented analytics techniques.

Today, business employees struggle to know what insights to act on because Business intelligence platforms are not contextualized, easily interpretable or actionable by the majority of users. Visual analytics and exploration will be replaced by more automated and customized experiences in the form of dynamic data stories. As a result to the shift to more dynamic, in-context data stories, the percentage of time spent on predefined dashboards will decline!

 

#3 – Decision intelligence

By 2023, more than 33% of large organizations will have analysts practicing decision intelligence including decision modeling.

A brief definition of decision intelligence is that it is a practical domain that frames a wide-range of decision-making techniques and integrates them to all critical parts of people, processes and technologies. It provides framework that brings traditional and advanced disciplines together to design, model, and execute and monitor decision models and processes in the context of business outcomes.</p>

The use of intelligent decision making will bring together decision management and techniques such as descriptive, diagnostic, predictive and prescriptive analytics.

 

#4 – Augmented Data Management: Metadata is the new black

By 2023, organizations utilizing active metadata, machine learning and data fabrics to dynamically connect, optimize and automate data management processes will reduce time to integrated data delivery by 30%.

The combination of colossal data volume, data trust issues and an ever increasing diversity of data formats is accelerating the demand for automated data management. In response, the potential to utilize metadata analytics poses a new solution to augmenting data management tasks. It is no secret that organizations need to easily know what data they have, what it means, how it delivers value, and whether it can be trusted. Metadata will emerge from a passive state to a highly active utilization state. Active utilization leverages cataloging, automatic data discovery by interpreting use cases and implies taxonomy and ontology that is crucial to data management.</p>

Through augmented data catalog, users can improve data inventorying efforts by significantly augmenting the otherwise cumbersome tasks of finding, tagging, annotating and sharing metadata.

 

#5 – Moving to the Cloud

By 2022, public cloud services will be essential for 90% of data and analytics innovation.

As Data Management accelerates its journey to the cloud, so will data & analytics disciplines. Cloud environments enable a more agile, fluid, diverse ecosystem that accelerates innovation in response to changing business needs that are not readily available in on-premises solutions. It also provides opportunities regarding cost optimization. It is expected to see offers such as “cloud first” capabilities, eventually become “cloud only” capabilities.

 

Gartner clients can read more in the report “Top 10 Trends in Data and Analytics, 2020.”

 

← Previous post Next post →

The must-have roles for the perfect data & analytics team

The must-have roles for the perfect data & analytics team

by Zeenea Software | Jul 8, 2020 | Data inspiration

data-team

As it’s been repeatedly said, digital business can not happen without data and analytics at its core. Technology can be a point of failure if not handled properly, but it is often not the most important roadblock to progress. In Gartner’s annual Chief Data Officer survey, the top roadblocks for success were human factors – culture, resources, data literacy and skills. A similar pattern emerges from another study, Gartner’s CEO and Senior Business Executive Survey, where “Talent Management” was listed as the “number one organizational competency to be developed or improved.”

In this article, we would like to focus on the key data and analytics roles & leaders that are essential for enterprises seeking a data-driven organization.

Support roles

Chief Data Officer

The Chief Data Officer, or CDO, is a senior executive responsible for enhancing the quality, reliability and access of data. They are also in charge of creating value from their data assets and from their data ecosystem in general. Through data exploitation and by enabling all forms of business outcomes through analytics, the CDO can produce more value with their enterprise data. There are many variations of the title such as CAO (Chief Analytics Officer), CDAO (Chief Data & Analytics Officer), CDIO (Chief Digital Information Officer), etc.

See more in our article “What is a Chief Data Officer?”

 

Data & Analytics Manager

As the title implies, the Data & Analytics manager is responsible for managing the data & analytics center and is responsible for its delivery throughout the entire organization. They are a key contributor to the strategy and vision for the data & analytics department, they build the roadmap and are responsible for budget and resource planning. Besides measuring the performance of their analytics team, they are also responsible for tracking the contribution of data analytics in regards to business objectives.

 

Data Architect

The Data Architect, also referred as the Information Architect, strengthens the impact and proves recommendations on business information. They make the information available and shared across the company by presenting how information assets drive business outcomes. They “own” the data models. They understand the impact various data analytics scenarios on the overall IT architecture (such as data science or machine learning) and work closely with the business department.

Analysts

There isn’t a single type of analysts, but rather a spectrum of analysts. Their roles depend on their use cases and vary by responsibilities and skill requirements. There are data analysts who have a foundational understanding of statistical analytics. They are, or work closely with domain experts to support business areas, processes, or functions.

 

Project Manager

The project manager is responsible for the successful implementation of all projects in the enterprise portfolio. They plan, execute and deliver projects in accordance with business priorities. Throughout the project’s lifecycle, the project manager tracks their project’s status and manages their teams to limit any risks. They are the primary point of contact for data and analytics initiatives.

 

Data Roles

Data Engineer

A Data Engineer involves collaboration across business units and IT units and is the practice of making the appropriate data accessible and available to various data consumers (data scientists, data analysts, etc.). They are primarily responsible for building, managing and operationalizing data pipelines in support of data and analytics use cases. Also, they are responsible for leading tedious tasks such as curating datasets created by non-technical users (through self-service data preparation tools for example).

Without data engineers, data & analytics initiatives are more costly, take longer to deploy, and are prone to data quality and availability problems.

Data Steward

Data stewards are the first point reference for data in the enterprise and serve as the entry point to access data. They must ensure the proper documentation of data and facilitate their availability to their users, such as data scientists or project managers for example. Their communication skills enable them to identify the data managers and knowers, as well as to collect associated information in order to centralize them and perpetuate this knowledge within the enterprise. In short, data stewards provide metadata; a structured set of information describing datasets. They transform these abstract data into concrete assets for the profession.

See here for more information on Data Stewards

Analytics roles

 

Data Scientists

A data scientist is responsible for modeling business processes and discovering insights using statistical algorithms and visualization techniques. They typically have an advanced degree in computer science, statistics or other related fields. Data Scientists contribute to building and developing the enterprise’s data infrastructure and supports the organization with insights and analysis for better decision making. They predict or classify information to develop better action models.

 

Citizen Data Scientist

Contrary to data scientists, a “Citizen Data Scientist” is not a job title. They are “power business users” who can perform both simple and sophisticated analytical tasks. They can execute a variety of data science tasks, supported by augmented analytics tools for data discovery, data preparation, and model deployment. Potential citizen data scientists will vary based on their skills and interest in data science and machine learning.

See here for more information on citizen data scientists

 

AI / ML Developer

Artificial intelligence / Machine learning developers are increasingly responsible for enriching applications through the use of machine learning or other AI technologies such as natural language processing, optimization or image recognition. They embed, integrate and deploy AI models that are developed by data scientists or other AI experts either offered by service providers or developed by themselves. Other key skills include identifying and connecting potential data assets, data quality, data preparation and how these are used for model training execution.

 

Conclusion

The growing importance and strategic significance of data and analytics is creating new challenges for organizations and their data and analytics leaders. Some traditional IT roles are being disrupted by “citizen” roles performed by nontechnical business users. Other new hybrid roles are emerging that cut across functions and departments, and blend IT and business skills.

By putting together these must-have roles, your enterprise is a step closer to becoming data-driven.

← Previous post Next post →
Data management is embracing Cloud technologies

Data management is embracing Cloud technologies

by Zeenea Software | Jun 29, 2020 | Data inspiration

Contemporary business initiatives such as digital transformation, are facing an explosion of data volume and diversity. In this context, organizations are looking for more flexibility and agility in their data management.

This is where Cloud strategies come in…

Data management definition

Before we begin, let’s define what data management is. Data Management, as described by TechTarget is “the process of ingesting, storing, organizing and maintaining the data created and collected by an organization”. Data management is a crucial part of an enterprise’s business and IT strategy, and provides analytical help that drives overall decision-making by executives.

As mentioned above, data is seen as a corporate asset that can be used to make better and faster decisions, improve marketing campaigns, increase overall revenue and profits, and above all: innovate. As a result, organizations are seeing cloud technologies as a way to improve their data initiatives.

Cloud strategies are the new black in data management disciplines

It is an undeniable fact that Cloud service providers are becoming the new default platform for database management. This phenomenon provides data management teams with great advantages:

  • Cost-effective deployment: Greater flexibility and a more rapid configuration.
  • Consumption-based spending: Pay what you use and do not over provision
  • Easy maintenance: better control over the associated costs and investments 

By knowing this, there is no doubt that data leaders perceive cloud as a less expensive technology, driving this choice even more.

Data leaders will embrace the cloud as an integral part of their IT landscape in the coming years and months. However, we strongly believe that the rate at which organizations migrate  to the cloud will differ by organization size. Small or midsize organizations will migrate quicker , while larger organizations will take months, even years to migrate. 

Thus, the Cloud is going to become a default option for all data management technologies.  Many strategies appear including various deployment types or approaches. We have identified 3 main strategies:

 

  • Hybrid Cloud: Made up of two or more separate Cloud infrastructures that may be private or public and that remain single entities
  • Multicloud: Use more than one cloud service provider infrastructure as well as on-premises solutions.
  • Intercloud: Where data is integrated or exchanged between cloud service providers as part of a logical application deployment.
Cloud data-management (1)

 

The Cloud is also seen as an opportunity for data analytics leaders

The increased adoption of cloud strategy deployments regarding data management  has important implications for data and analytics strategies. As data is  moving to the cloud, the data and analytics applications they use must follow.

Indeed, the emphasis on the speed of value delivery has made cloud technologies the first choice for new data management solution development for vendors, and deployment for enterprises. Thus, enterprises and data leaders are choosing next-gen data management solutions! They will migrate their assets by selecting applications that connect to future cloud strategies and preparing their teams & budgets for the upcoming challenges they will overcome..

Those data leaders who use analytics, business intelligence (BI) and data science solutions are seeing Cloud solutions as greater opportunities to:

  • Use a cloud sandbox environment for trial purposes in terms of onboarding, usages, connectivity and create a prototyping analytics environment before actually buying the solution.
  • Facilitate application access wherever you are and improve collaboration between peers.
  • Access to new emerging capabilities over time with ease, with continuous delivery approaches.
  • Support heavy lifting with the cloud’s elasticity and scalability along the analytics process.

 

A data catalog, the new essential solution for cloud data management strategies

Data and analytics leaders will inevitably engage in more than one cloud where data management, governance and integration become more complex than ever before. Thus, data leaders must equip their organization to new metadata management solutions to assist in finding and inventorying data distributed across a hybrid and multi cloud ecosystem.  Failure to do so will result in a proliferation of data silos, leading to derailed data management, analytics and data science projects.

Data management teams will have to choose among the wide-range of data catalog in the market the most relevant one.

We like to define a data catalog as a way to create and maintain an inventory of data assets through the discovery, description and organization of distributed datasets.
If you are working on the data catalog project, you will find: 

  • On the one hand by fairly old players, initially positioned on the Data Governance market.
    These players provide on premises solutions with rich but complex offers, which are expensive, difficult and time-consuming to deploy and maintain, and are designed for cross-functional governance teams. Their value proposition is focused on control, risk management and compliance.

  • on the other hand by suppliers of data infrastructures (Amazon, Google, Microsoft, Cloudera, etc.) or data processing solutions (Tableau, Talend, Qlik, etc.), for which metadata management is an essential block to complete their offer. They offer much more pragmatic (and less costly) solutions, but are often highly technical and limited to their ecosystem.

 We consider those alternatives as not sufficient enough. Here are  some essential guidelines to find your future data catalog. It must:

– Be a cloud data catalog enabling competitive pricing and rapid ROI for your organization.
– Have universal connectivity, adapting to all systems and all data strategies (edge, cloud, multi-cloud, cross-cloud, hybrid).
– Have very advanced automation for the collection and enrichment of data assets as well as their attributes and links (augmented catalog). The automatic feeding mechanisms, as well as the suggestion and correction algorithms reduce the overall cost of the catalog and guarantees the quality of the information it contains
– Be strongly focused on user experience, especially for business users, to improve solution adoption.

 

Cloud data management -2

To conclude, data management capabilities are becoming more and more cloud-first and in some cases cloud-only.

Data leaders who want to drive innovation in analytics will need to leverage cloud technologies from data assets. They will have to go from ingestion to transformation without forgetting to invest in an efficient data catalog in order to find their way in an ever more complex data world.

← Previous post Next post →

WhereHows: A  data discovery and lineage portal for LinkedIn

WhereHows: A data discovery and lineage portal for LinkedIn

by Zeenea Software | Apr 20, 2020 | Data inspiration, Metadata Management

Metadata is becoming increasingly important for modern data-driven enterprises. In a world where the data landscape is increasing at a rapid pace, and information systems are more and more complex, organizations in all sectors have understood the importance of being able to discover, understand and trust in their data assets.

Whether your business is in the streaming industry such as Spotify or Netflix , the ride sharing industry such as Uber or Lyft, or even the rental business like Airbnb, it is essential for data teams to be equipped with the right tools and solutions that allow them to innovate and produce value with their data.

In this article, we will focus on WhereHows, an open source project led by the LinkedIn data team, that works by creating a central repository and portal for people, processes, and knowledge around data. With more than 50 thousand datasets, 14 thousand comments, and 35 million job executions and related lineage information, it is clear that LinkedIn’s data discovery portal is a success.

 

First, LinkedIn key statistics

Founded by Reid Hoffman, Allen Blue, Konstantin Guericke, Eric Ly, and Jean-Luc Vaillant in 2003 in California, the firm started out very slowly. In 2007, they finally became profitable, and in 2011 had more than 100 million members worldwide.

As of 2020, LinkedIn significantly grew:

  • More than 660 million LinkedIn members worldwide, with 206 million active users in Europe,
  • More than 80 million users on LinkedIn Slideshare,
  • More than 9 billion content impressions,
  • 30 millions companies registered worldwide.

LinkedIn is definitely a must-have professional social networking application for recruiters, marketers, and even sales professionals. So, how does the Web Giant keep up with all of this data?

 

How it all started

Like most companies with a mature BI ecosystem, Linkedin started out with a data warehouse team, responsible for integrating various information sources into consolidated golden datasets. As the number of datasets, producers and consumers grew, the team increasingly felt overwhelmed by the colossal amount of data being generated each day. Some of their questions were:

  • Who is the owner of this data flow?
  • How did this data get here?
  • Where is the data ?
  • What data is being used ?

In response, Linkedin decided to build a central metadata repository to capture their metadata across all systems and surface it through a unique platform to simplify data discovery: WhereHows!

What is WhereHows exactly?

WhereHows integrates with all data processing environments and extracts metadata from them.

Then, it surfaces this information via two different interfaces:

  1. A web application that enables navigation, searching, lineage visualization, discussions, and collaboration,
  2. An API endpoint that empowers the automatization of other data processes and applications.

This repository enables LinkedIn to solve problems around data lineage, data ownership, schema discovery, operational metadata mashup, data profiling, and cross-cluster comparison. In addition, they implemented machine-based pattern detection and association between the business glossary and their datasets, and created a community based on participation and collaboration that enables them to maintain metadata documentation by encouraging conversations and pride in ownership.

There are three major components of WhereHows:

 

  1. A data repository that stores all metadata
  2. A web server that surfaces data through API and UI
  3. A backend server that fetches metadata from other information sources

How does WhereHows work?

The power of WhereHows comes from the metadata it collects from Linkedin’s data ecosystem. It collects the following metadata:

  • Operational metadata, such as jobs, flows, etc.
  • Lineage information, which is what connects jobs datasets together,
  • The information catalogued such as the dataset’s location, its schema structure, ownership, create date, and so on.

How they use metadata

WhereHows uses a universal model that enables data teams to better leverage the value from the metadata; for example, by conducting a search across the different platforms based on different aspects of datasets.

Also, the metadata in a dataset and the job operational metadata are two endpoints. The lineage information connects them together and enables data teams to trace from a datasets/jobs to its upstream/downstream jobs/datasets. If the entire data ecosystem is collected into WhereHows, they can trace the data flow from start to finish!

How they collect metadata

The method used to collect metadata depends on the source. For example, Hadoop datasets have scraper jobs that scan through HDFS folders and files, reads the metadata, then stores it back.

For schedulers such as Azkaban, they connect their backend repository to get the metadata, aggregate it and transform it to the format they need, then load it into WhereHows. For the lineage information, they parse the log of a MapReduce job and a scheduler’s execution log, then combine that information together to get the lineage.

 

What’s next for WhereHows?

Today, WhereHows is actively used at Linkedin as not only a metadata repository, but also to automate other data projects such as automated data purging for compliance. In 2016, they integrated with systems down below:

In the future, Linkedin’s data teams hope to broaden their metadata coverage by integrating more systems such as Kafka or Samza. They also plan on integrating with data lifecycle management and provisioning systems like Nuage or Goblin to enrich the metadata. WhereHows has not said its final word!

Sources:

 

  • 50 of the Most Important LinkedIn Stats for 2020: https://influencermarketinghub.com/linkedin-stats/
  • Open Sourcing WhereHows: A Data Discovery and Lineage Portal:
    https://engineering.linkedin.com/blog/2016/03/open-sourcing-wherehows–a-data-discovery-and-lineage-portal

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

Data Ops rules to avoid Data Oops

Data Ops rules to avoid Data Oops

by Zeenea Software | Apr 6, 2020 | Data inspiration

Data Ops is a new way to address the deployment of data and analytics solutions.

The success of this methodology is based on techniques that promote faster, more flexible and more reliable data delivery. To deliver on this promise, let’s take a moment and analyze this sentence: “the focus is not just on building the systems right, but also building the right systems”.

There are many different definitions, interpretations, and publications that address DataOps as a concept, but it is much more than just that. It is a way of understanding, discovering, producing analysis, and creating actionable intelligence with data. In a changing world that revolves around data, latencies in data products or their analysis are no longer acceptable.

The entire organization must be put to work to support the deployment and improvement of data and analysis projects!

 

Data Oops definition

The concept of DataOps emerged in response to the challenges of failing data systems, failing data project implementations, but also the fragility, friction or even fear, when it comes to the use of data. If you are experiencing this situation then don’t look too far… You are in the middle of a Data Oops!

In this context of Data Oops, you will agree that your data teams are struggling to achieve the speed and reliability of directed projects.

The main reasons are that companies have too many roles, are too complex and are constantly changing requirements or objectives making tasks difficult to frame and deliver.

This complexity is exacerbated by a lack of confidence in data, even to the point of “fearing” it. This occurs when we observe limited or inconsistent coordination between the different roles involved in the construction, deployment and maintenance of data flows. We are convinced that an organization that does not know their data are doomed to fail…

 

How to succeed in your DataOps?

Simply put, DataOps is a collaborative data management practice that aims to improve communication, integration and automation of data flows between data managers and data consumers within an organization. It is based on the alignment of objectives confronted by results. DataOps accepts failure and is built through continuous experimentation.

Here’s a list of principles for successful DataOps:

  1. Learn from DevOps, through their techniques for developing and deploying agile applications in your data and analysis work.
  2. Identify quantifiable, measurable and achievable business objectives. You will then be able to communicate more regularly, progress towards a common goal and adjust more easily.
  3. Start by identifying and mapping your data (type, format, who, when, where, why, etc.) using data catalog solutions.
  4. Encourage collaboration between different data stakeholders by providing communication channels and solutions for sharing metadata.
  5. Take care of your data, as it may produce value at any given time. Clean it, catalog it, and make it a part of your enterprise’s key assets, whether it is valuable now or not.
  6. A model may work well once, but not on the next batch of data. Over-specifying and over-engineering a model will likely not be applicable to previously unseen data or for new circumstances in which the model will be deployed.
  7. Maximize your chances of success of introducing a DataOps approach by selecting data and analysis projects that are struggling due to a lack of collaboration or are struggling to keep pace. They will allow you to better demonstrate its value.
  8. Keep it agile, short designed, develop, test, release, and repeat! Keep it lean and build on incremental changes. Continuous improvement is found when a culture of experimentation is encouraged and when people learn from their failures. Remember, data science is still science!

 

In summary, what are the benefits of DataOps?

DataOps helps your business move at the speed of data – keeping pace to deliver the right data. It focuses data activities to be aligned with business objectives, and not on the analytic inputs (big data hype). DataOps also focuses on delivering value from all your data activities, from even the smallest of these can inspire cultural changes needed for other implementations to come.

Adopting DataOps in a culture of experimentation is good data practice and empowers the innovators across the organization to smart small and scale fast.

It is the path to good business practices, and the path that steers you away from Data Oops!

← Previous post Next post →
Everything you need to know about Data Ops

Everything you need to know about Data Ops

by Zeenea Software | Mar 26, 2020 | Data inspiration

“Within the next year, the number of data and analytics experts in business units will grow at three times the rate of experts in IT departments, which will force companies to rethink their organizational models and skill sets. – Gartner, 2020.

Data & Analytics teams are becoming more and more essential in supporting various complex business processes, and many are challenged with scaling the work they do in delivering data to support their use cases. The pressure to deliver faster and with higher quality is causing data & analytics leaders to rethink how their teams are organized…

Where traditional waterfall models were implemented and used in enterprises in the past, these methodologies are now proving to be too long, too siloed, and too overwhelming!

This is where Data Ops steps in: a more agile, collaborative and change-friendly approach for managing data pipelines.

 

Data Ops definition

Gartner defines Data Ops as being a “collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization.”. Basically, making life easier for data users.

Similar to how DevOps, a set of practices that combines software development (Dev) and information-technology operations (Ops), changed the way we deliver software, DataOps uses the same methodologies for teams building data products.

While both agile frameworks, DataOps requires the coordination of data and anyone that works with data across the entire enterprise.

Specifically, data & analytics leaders should implement these key approaches that proved to deliver significant value for organizations:

  • Deployment frequency increase: shifting towards a more rapid and continuous delivery methodology enables organizations to reduce the time to market.
  • Automated testing: removing time-consuming, manual testing enables higher quality data deliveries.
  • Metadata control: tracking and reporting metadata across all consumers in the data pipeline ensures better change management and avoids errors.
  • Monitoring: tracking data behavior and the usage of the pipeline enables more rapid identification on both flawed – that needs to be corrected – and good quality data for new capabilities.
  • Constant collaboration: communication between data stakeholders on data is essential for faster data delivery.

 

Who is involved in Data Ops?

Given the importance of data and analytics use cases today, the roles involved in successful data project delivery are more numerous and more distributed than ever before. Ranging from data science teams to people outside of IT, a large number of roles are involved:

  • Business analysts,
  • Data architects,
  • Data engineers,
  • Data stewards,
  • Data scientists,
  • Data product managers,
  • Machine Learning developers,
  • Database administrators,
  • Etc.

As mentioned above, a Data Ops approach requires fluid communication and collaboration across these roles. Each collaborator needs to understand what others expect of them, what others produce, and must have a shared understanding of the goals of the data pipelines they are creating and evolving.

 

Creating channels through which these roles can work together, such as a collaboration tool, or metadata management solution, is the starting point!

← Previous post Next post →
How Spotify improved their Data Discovery for their Data Scientists

How Spotify improved their Data Discovery for their Data Scientists

by Zeenea Software | Mar 19, 2020 | Data inspiration

Photo by Gavin Whitner

As the world leader in the music streaming market, it is without question that the huge firm is driven by data.

Spotify has access to the biggest collections of music in the world, along with podcasts and other audio content.

Whether they’re considering a shift in product strategy or deciding which tracks they should add, Spotify says that “data provides a foundation for sound decision making”.

Spotify in numbers

Founded in 2006 in Stockholm, Sweden, by Daniel Ek and Martin Lorentzon, the leading music app’s goal was to create a legal music platform in order to fight the challenge of online music piracy in the early 2000s.

Here are some statistics & facts about Spotify in 2020:

  • 248 million active users worldwide,
  • 20,000 songs are added per day on their platform,
  • Spotify has 40% share of the global music streaming market,
  • 20 billion hours of music were streamed in 2015

These numbers not only represent Spotify’s success, but also the colossal amounts of data that is generated each year, let alone each day! To enable their employees, or as they call them, Spotifiers, to make faster and smarter decisions, Spotify developed Lexikon.

Lexikon is a library of data and insights that helps employees find and understand their data and knowledge generated by their expert community.

 

What were the data issues at Spotify?

In their article How We Improved Data Discovery for Data Scientists at Spotify, Spotify explains that they started their data strategy by migrating data to the Google Cloud Platform, and saw an explosion of their datasets. They were also in the process of hiring many data specialists such as data scientists, analyst, etc. However, they explain that datasets lacked clear ownership and had little-to-no documentation, making it difficult for these experts to find them.

The next year, they released Lexikon, as a solution for this problem.

Their first release allowed their Spotifiers to search and browse through available BigQuery tables as well as discover past researches and analysis. However, months after the launch, their data scientists were still reporting data discovery as a major pain point, spending most of their time trying to find their datasets therefore delaying informed decision-making.

Spotify decided then to focus on this specific issue by iterating on Lexikon, with the unique goal to improve data discovery experience for data scientists.

How does Lexikon data discovery work?

In order for Lexikon to work, Spotify started out by conducting research on their users, their needs as well as their pain points. In doing so, the firm was able to gain a better understanding of their users intent and use this understanding to drive product development.

 

Low intent data discovery

For example, you’ve been in a foul mood so you’d like to listen to music to lift your spirits. So, you open Spotify, browse through different mood playlists and put on the “Mood Booster” playlist.

Tah-dah! This is an example of low-intent data discovery, meaning your goal was reached without extremely strict demands.

To put this into Spotify’s data scientists context, especially new ones, their low intent data discovery would be:

  • find popular datasets used widely across the company,
  • find datasets that are relevant to the work my team is doing, and/or
  • find datasets that I might not be using, but I should know about.

So in order to satisfy these needs, Lexikon has a customizable homepage to serve personalized recommendations to users. The homepage recommends potentially relevant, automatically generated suggestions for datasets such as:

 

  • popular datasets used within the company,
  • dataset recently used by the user,
  • datasets widely used by the team the user belongs to.

High intent data discovery

To explain this in simple terms, Spotify uses the example of hearing a song, and researching it over and over in the app until you finally find it, and listen to it on repeat. This is high intent data discovery!

A data scientist at Spotify with high intent has specific goals and is likely to know exactly what they are looking for. For example they might want to:

  • find a dataset by its name,
  • find a dataset that contains a specific schema field,
  • find a dataset related to a particular topic,
  • find a dataset that a colleague used of which they can’t remember the name,
  • find the top datasets that a team has used for collaborative purposes.

To fulfill their data scientists needs, Spotify focused first on their search experience.

They built a search ranking algorithm based on popularity. By doing so, data scientists reported that their search results were more relevant, and had more confidence in the datasets they discovered because they were able to see which dataset was more widely-used by the company.

In addition to improving their search rank, they introduced new types of properties (schemas, fields, contact, team, etc.) to Lexikon to better represent their data landscape.

These properties are able to open up new pathways for data discovery. In the example down below, a data scientist is searching for a “track_uri”. They are able to navigate through the “track_uri” schema field page and see the top tables containing this information. Since adding this new feature, it has proven to be a critical pathway for data discovery, with 44% of Lexikon users visiting these types of pages.

’

Final thoughts on Lexikon

Since making these improvements, the use of Lexikon amongst data scientists has increased from 75% to 95%, putting it in the top 5 tools used by data scientists!

Data discovery is thus, no longer a major pain point for their Spotifiers.

Sources:

Spotify Usage and Revenue Statistics (2019): https://www.businessofapps.com/data/spotify-statistics/
How We Improved Data Discovery for Data Scientists at Spotify: https://labs.spotify.com/2020/02/27/how-we-improved-data-discovery-for-data-scientists-at-spotify/
75 amazing Spotify Statistics and Facts (2020): https://expandedramblings.com/index.php/spotify-statistics/

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

Metadata through the eyes of Web Giants

Metadata through the eyes of Web Giants

by Zeenea Software | Mar 17, 2020 | Data inspiration

Data life cycle analysis is an element in data management that enterprises are still struggling to implement.

Organizations at the forefront of data innovation such as Uber, LinkedIn, Netflix, Airbnb and Lyft have also seen the value of metadata in the magnitude of this challenge.

They thus developed a metadata management strategy using dedicated platforms. Frequently developed on a custom basis, they facilitate data ingestion, indexing, search, annotation and discovery in order to maintain high quality datasets.

The following examples highlight a shared constant: the difficulty, increased by volume and variety, of transforming business data into exploitable knowledge.

Let’s take a look at the analysis and context of these Web giants:

Uber

Every interaction on Uber’s platform, from their ride sharing services to their food deliveries, is data-driven. Through analysis, their data enables more reliable and relevant user experiences.

Uber’s key stats

  • thousands of billions of Kafka messages a day,
  • hundreds of petabytes of data in HDFS in data centers,
  • millions of analytical queries weekly.

However, the volume of data generated alone is not sufficient to leverage the information it represents; to be used effectively and efficiently, data requires more context to make optimal business decisions.

To provide additional information, Uber therefore developed “Databook”, the company’s internal platform that collects and manages metadata on internal datasets in order to transform data into knowledge.

Databook is designed to enable Uber employees to effectively explore, discover and use Uber’s data. Databook gives context to their data (its meaning, quality, etc) and ensures that it is maintained in its platform for the thousands of employees who want to analyze the data. In short, Databook’s metadata enables data leaders to move from viewing raw data to actionable knowledge.

In the article Databook: Turning Big Data into Knowledge with Metadata at Uber, the article concludes that one of the biggest challenges for Databook was to move from manual metadata repository updates to automation.

Airbnb

At a conference in May 2017, John Bodley, Data Engineer at AirBnB, outlined new issues arising from the company’s growth: a confusing and non-unified landscape that wasn’t allowing access to increasingly important information.
What can we do with all this data collected on a daily basis? How do we turn them into assets for all Airbnb employees?

A dedicated team set out to develop a tool that would democratize access to data within the company. Their work was based both on the knowledge of the analysts and their ability to understand the critical points, and on that of the engineers, who were able to offer a more technical vision. At the heart of the project, interviews of employees concerning their issues were conducted.

What emerged from this survey was a difficulty in finding the information employees needed to work, and a still too tribal approach to sharing and holding information.

To meet these challenges, AirBnB created Data Portal, a metadata management platform. Data Portal centralizes and shares this information via this self-service platform.

Lyft

Lyft is a ride-sharing service and is Uber’s main competitor in the North American market.

The company found they were inefficiently providing data access for its analytical profiles. Its reflections focused on making data knowledge available to optimize its processes. In just a few months, their goal of creating an interface for researching data presented these two major challenges:

  • Productivity – Whether it’s to create a new model, instrument a new metric, or perform an ad hoc analysis, how can Lyft use this data in the most productive and efficient way possible?
  • Compliance – When collecting data about an organization’s users, how can Lyft comply with increasing regulatory requirements and maintain the trust of its users?

In their article Amundsen – Lyft’s data discovery & metadata engine, Lyft states that the key does not lie in the data, but in the metadata!

Netflix

As the world leader in video streaming, data exploitation at Netflix is, of course, a major strategic focus.

Given the diversity of their data sources, the video platform wanted to offer a way to federate and interact with these assets from a single tool. This search for a solution led to Metacat.

This tool acts as a layer of access to data and metadata from Netflix data sources. It allows its users to access data from any storage system through three different features:

  1. Adding business metadata: By hand or user-defined, business metadata can be added via Metacat.
  2. Data discovery: The tool publishes schema and business metadata defined by its users in Elasticsearch, facilitating full-text search of information in data sources.
  3. Data Change Notification and Auditing: Metacat records and notifies all changes to metadata from storage systems.

In their blog article, “Metacat: Making Big Data Discoverable and Meaningful”, at Netflix, the firm confirms that they are far from finished working on their solution!

There are a few more features they have yet to work on to improve the data warehousing experience:

  • Schema and metadata visioning to provide table history.
  • Provide contextual information on arrays for better data lineage.
  • Add support for datastores like Elasticsearch and Kafka.

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

Amundsen: How Lyft is able to easily discover their data

Amundsen: How Lyft is able to easily discover their data

by Zeenea Software | Feb 27, 2020 | Data inspiration

In our last article, we spoke of Uber’s Databook , an in-house platform designed by their very own engineers with the aim to turn data into contextualized assets. In this article, we will focus on Lyft’s very own data discovery and metadata platform: Amundsen.

In response to Uber’s success, the ride-sharing market saw a major wave of competitors arrive and among those, there is Lyft.

Lyft key figures & statistics

Founded in 2012 in San Francisco, Lyft operates in more than 300 cities across the United States and Canada. With over 29% of the US ride-sharing market*, Lyft has certainly secured the second position for itself, standing neck to neck with Uber. Some key statistics on Lyft include:

  • 23 million Lyft users as of January 2018,
  • More than a billion Lyft rides,
  • 1,4 million drivers (Dec. 2017).

And of course, those numbers have transformed into colossal amounts of data to manage! In a modern data-driven company such as Lyft, it is evident that their platform is powered by their data. With the rapid increase of the data landscape, it becomes increasingly difficult to know what data exists, how to access them and what information is available.

This problem led to the creation of Amundsen, Lyft’s open source data discovery solution and metadata platform.

Let’s get to know Amundsen

Named after the Norwegian explorer Roald Amundsen, Lyft improves their data users productivity by providing an intuitive search interface for data, that looks like this:

While Lyft’s data scientists wanted to spend the majority of the time on model development and production, they realized that most of their time was being spent on data discovery. They would find themselves asking questions such as:

  • Does this data exist? If it does, where can I find it? Can I access it?
  • Who / which team is the owner? Who are the common users?
  • Can I trust this data?

To answer these questions, Lyft was inspired by search engines like Google.

As shown above, their entry point is a simple search box where users can type any keyword such as “customers” “employees” or “price”. However, if the data user does not know what they are looking for, the platform presents the user with a list of the most popular tables, so they can browse through them freely.

 

Some key features:

The search results are shown in “list form” where the description about the table and the date when the table was last updated appears. The ranking used is similar to Google’s Page Rank, where the most popular and relevant tables show up in the first results.

When a data user at Lyft finds what they’re looking for and selects their choice, the user is directed to a detail page which shows the name of the table as well as its manually curated description. Users can also manually insert tags, the owners, and other descriptions. However, a lot of their metadata is automatically curated such as the table’s popularity or even its frequent users.

When in a table, users are able to explore the associated columns to further discover the table’s metadata.

For example, if you were to select the column “distance_travelled” as shown below, you will find a small definition of the field and its related stats such as the count record, the max count, min count, average count, etc, for data scientists to better understand the shape of their data.

Lastly, users can have access to view the data of the dataset by pressing the preview button of the page. Of course, this is only possible if the user has access to the underlying data in the first place.

How Amundsen democratizes data discovery

Showing the relevant data

Amundsen now empowers all employees at Lyft, from new employees to the most experienced, to become autonomous in their data discovery for their daily tasks.

Now let’s talk technical. Lyft’s data warehouse is on Hive and all physical partitions are stored in S3. Their data users rely on Presto, a live query engine, for their table’s discovery. In order for their search engine to show the most important or relevant tables for their users, Lyft uses the DataBuilder framework to build a query usage extractor that parses query logs to get table usage data. Then, they persist in this table usage as an Elasticsearch table document. And that’s how, in very short, they are able to retrieve the most relevant datasets for their data users.

Connecting data with people

As much as we like to claim how technical and digital we all are, processes for finding data consists mainly in interactions with people. And the notion of Data ownership is quite confusing; it is very time consuming unless you know exactly who to ask.

Amundsen addresses this issue by creating relationships between their users and their data thus, tribal knowledge is shared through exposing these relationships.

Lyft currently has three types of relationships between users and data: followed, owned and used. This information helps experienced employees become helpful resources for other employees with a similar job role. Amundsen also makes the tribal knowledge easier to find thanks to a link to each user profile on the internal employee directory.

They’ve also been working on implementing a notifications feature that would allow users to request more information from the data owners like for example, a missing description in a table.

If you’d like more information on Amundsen, please visit their website here.

What’s next for Lyft

Lyft is hoping to continue working with a growing community to enhance their data discovery experience and boost user productivity. Their roadmap currently includes email notifications system, data lineage, UI/UX redesign, and more!

The ride sharing company has not had its final word yet!

Sources:

Lyft – Statistics & Facts: https://www.statista.com/topics/4919/lyft/
Lyft And Its Drive Through To Success: https://www.startupstories.in/stories/lyft-and-its-drive-through-to-success
Lyft Revenue and Usage Statistics (2019): https://www.businessofapps.com/data/lyft-statistics/
Presto Infrastructure at Lyft: https://eng.lyft.com/presto-infrastructure-at-lyft-b10adb9db01?gi=f100fa852946
Open Sourcing Amundsen: A Data Discovery And Metadata Platform: https://eng.lyft.com/open-sourcing-amundsen-a-data-discovery-and-metadata-platform-2282bb436234
Amundsen — Lyft’s data discovery & metadata engine: https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

Databook: How Uber turns data into exploitable knowledge with metadata

Databook: How Uber turns data into exploitable knowledge with metadata

by Zeenea Software | Feb 17, 2020 | Data inspiration

Uber is one of the most fascinating companies to emerge over the past decade. Founded in 2009, Uber grew to become one of the highest valued startup companies in the world! In fact, there is even a term for their success: “uberization” which refers to changing the market for a service by introducing a different way of buying or using it, especially using mobile technology.

From peer-to-peer ride services to restaurant orders, it is clear Uber’s platform is data driven. Data is the center of the Uber’s global marketplace, creating better user experiences across their services for their customers, as well as empowering their employees to be more efficient at their jobs.

However, Big Data by itself wasn’t enough; the amount of data generated at Uber requires context to make business decisions. So as many other unicorn companies did such as Airbnb with Data Portal, Uber’s Engineering team built Databook. This internal platform aims to scan, collect and aggregate metadata in order to see more clearly on the location of data in Uber’s IS and their referents. In short, a platform that wants to transform raw data into contextualized data.

 

How Uber’s business (and data) grew

Since 2016, Uber has added new lines of businesses to its platform including Uber Eats and Jump Bikes. Some statistics on Uber include: 

  • 15 million trips a day
  • Over 75 million active riders
  • 18,000 employees since its creation in 2009

As the firm grew, so did its data and metadata. To ensure that their data & analytics could keep up with their rapid pace of growth, they needed a more powerful system for discovering their relevant datasets. This led to the creation of Databook and its metadata curation.

 

The coming of Databook

The Databook platform manages rich metadata about Uber’s datasets and enables employees across the company to explore, discover, and efficiently use their data. The platform also ensures their data’s context isn’t lost among the hundreds of thousands of people trying to analyse it. All in all, Databook’s metadata empowers all engineers, data scientists and IT teams to go from just visualizing their data to turning it into exploitable knowledge

 

Databook enables employees to leverage automated metadata in order to collect a wide variety of frequently refreshed metadata. It provides a wide variety of metadata from Hive, MySQL, Cassandra and other internal storage systems.To make them accessible and searchable, Databook offers its consumers a user interface with a Google search engine or its RESTful API.

 

Databook’s architecture

Databook’s architecture is broken down into three parts: how the metadata is collected, stored, and how its data is surfaced.

Conceptually, the Databook architecture was designed to enable four key capabilities:

  • Extensibility: New metadata, storage, and entities are easy to add.
  • Accessibility: Services can access all metadata programmatically.
  • Scalability: Support business user needs and technology novelty
  • Power & speed of execution

To go further on Databook’s architecture, please read their article https://eng.uber.com/databook/

What’s next for Databook?

With Databook, metadata at Uber is now more useful than ever!

But they still hope to develop other functionalities such as the abilities to generate data insights with machine learning models and create advanced issue detection, prevention, and mitigation mechanisms.

Sources

  • Databook: Turning Big Data into Knowledge with Metadata at Uber: https://eng.uber.com/databook/
  • How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions: https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-management-and-discovery-for-machine-9b79ee9184bb
  • The Story of Uber https://www.investopedia.com/articles/personal-finance/111015/story-uber.asp
  • The definition of uberization, Cambridge dictionary: https://dictionary.cambridge.org/dictionary/english/uberization

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

What is Data Fingerprinting and similarity detection?

What is Data Fingerprinting and similarity detection?

by Zeenea Software | Dec 3, 2019 | Data inspiration

With the emergence of Big Data, enterprises found themselves with a colossal amount of data. In order to understand and analyse their data, as well as meet the various regulatory requirements, it is vital for organizations to document their data assets. However, documenting and giving context to thousands of datasets is a very difficult, even impossible, task to do by hand.

Or, you can use Data Fingerprinting!

What is Data Fingerprinting?

In the data domain, a fingerprint represents a “signature”, or fingerprint, of a data column. The goal here is to give context to these columns.

Via this technology, a Data Fingerprint can automatically detect similar datasets in your databases and can document them more easily, making data steward’s tasks less fastidious and more efficient. For example, supervised by the data steward, data fingerprinting technologies allow us to understand that a column of data with the information “France”, “United States”, and “Australia” represents “Countries”.

Data Fingerprinting at Zeenea

In Zeenea’s case, our metadata management platform’s objective is to give meaning and context to your catalogued datasets in the most automatic way as possible. With our Machine Learning technologies, Zeenea identifies dataset schema columns, analyses them and gives them their own “signature”. In this way, if any of these fingerprints are similar, our Data Catalog will make suggestions as to whether the Data Steward should give the same information relative to another.

This technology also gives a means for DPOs to, among others, underline and point out personal or sensitive information that the organization possesses in its databases.

Contact us
← Previous post Next post →

How does data visualization bring value to an organization?

How does data visualization bring value to an organization?

by Zeenea Software | Oct 1, 2019 | Data inspiration

Data visualization definition

Data visualization is defined as a graphical representation of data. It is used to help people understand the context and significance of their information by showing patterns, trends and correlations that may be difficult to interpret in plain text form.

These visual representations can be in the form of graphs, pie charts, heat maps, sparklines, and much more.

What are the advantages of data visualization?

In BI, or Business Intelligence, data visualization is already a must have feature. With the emergence of Big Data, data visualization is becoming even more critical to help data citizens make sense of the millions of data being generated everyday. Not only does it help data citizens curate their data into an easy-to-understand visual representation, it also allows for employees to save time and work more efficiently.

In a way, data visualization also allows organizations to democratize data for everyone within an organization. With this, Data Leaders like Chief Data Officers see in this discipline a way to replace intuition decision-making with data analysis. Thus, be able to evangelize a data driven culture within their enterprises.

How can you get more value from modern data visualization platforms?

Most organizations that adopt data visualization tools struggle to visually represent their data in a way that maximizes data value. However, modern data visualization tools are expanding to include new use cases. These tools enable enterprises to find and communicate opportunities on important data analysis. Their strengths are:

Better communication and understanding of data

Data visualization allows employees, even those agnostic to data, to understand, analyse and communicate on data with new more interactive formats. This corporate will to become data-driven leads them to better inform and train their organizations to understand how to use data visualization tools and their relevant formats. These formats can be heat maps, bubble charts, tree maps, waterfall charts, etc.

maptive-data-visualisation-tools

Source: https://www.maptive.com/17-impressive-data-visualization-examples-need-see/

More interactions on data analysis

Reporting data is becoming more collaborative in organizations and presenting data a daily activity. Thus, data visualization is becoming more “responsive” allowing it to adapt to any device and any place the data is being shared. These tools open up to web and mobile techniques to share data stories and explore data collaboratively. Moreover, it’s large-format screens that create a more general understanding of the data in management meetings, for instance.

Supporting data storytelling

Data storytelling is about communicating findings rather than monitor or analyze their progress. Companies such as Data Telling and Nugit specialize in this. With the use of infographics, data visualization platforms can support data storytelling techniques in communicating the meaning of the data to the management teams. These kinds of representations grab people’s attention and better help them recall the information later.

An automatic data visualization

Data users are increasingly expecting their analytic software to do more for them. Augmented data visualization is very useful, where people are not sure which visual format is best-suited for the dataset they want to explore or analyze. These automatic features are best made for citizen data scientists, whose time will be spent on analyzing data and finding new use-cases rather than visualizing them.

 

Gartner’s top Analytics & BI platforms

According to Gartner, the analytics and business intelligence platform leaders are:

microsoft bi

  • Microsoft: Power BI by Microsoft is a customizable data visualization toolset that gives you a complete view of your business. It allows employees to collaborate and share reports inside and outside their organization and spot trends as they happen. Click for more information.

tableau data visualization

  • Tableau: Tableau helps people transform data into actionable insights. They allow users to explore with limitless visual analytics, build dashboards, perform ad hoc analyses, and more. Find out more about Tableau.

qlik data visualisation

 

  • Qlik: With Qlik, users are able to create smart visualizations, and drag and drop to build rich analytics apps acceleratedby suggestions and automation from AI. Read more about Qlik.

thoughtspot data visualization

  • ThoughtSpot: ThoughtSpot allows user to get granular insights from billions of rows of data.With AI technology, uncover insights from questions you might not have thought to ask. Click for more information on ThoughtSpot

In conclusion: why should enterprises use data visualization?

The main reasons that data visualization is important to enterprises, among others, are:

  • Data is easier to understand and remember
  • Visualizing data trends and relationships is quicker
  • Users are able to discovery data that they couldn’t have seen before
  • Data leaders can make better, data-driven decisions
← Previous post Next post →

Artificial Intelligence Conferences at AI Paris 2019

Artificial Intelligence Conferences at AI Paris 2019

by Zeenea Software | Jun 25, 2019 | Data inspiration

As a sponsor for this year’s AI Paris 2019, we were able to attend several conferences based on artificial intelligence. Among them, two very interesting keynotes.

Human resources & the future of AI

For the second year in a row, Malakoff Médéric Humanis published their survey on artificial intelligence and human resources. The private French health insurance company interviewed nearly 1,800 executives, managers, and employees on what their vision is on AI in their enterprises. As a matter of fact, David Giblas, Chief Innovation, Digital and Data Officer at Malakoff Médéric, explains that there is a general awareness on the importance of artificial intelligence. However, even in 2019, it is still not considered a strategic asset by enterprises.. It is time for enterprises to change in order for them to transform. But one question remains: what are the impacts of artificial intelligence on human resources?

The Malakoff Médéric experts explain that today, ethics is at the heart of AI concerns in enterprises. In fact, it is the main subject that worries them! David Giblas presents that 78% of executives feel that it is up to the Human Resources department to fight against the ethical biases that could be introduced by artificial intelligence. Among them, for example, is a machine programmed by algorithms in the recruitment process. It could discriminate against certain people’s resumes based on their last name, address, family situation, etc.

“These organizational changes take time and their success will mainly depend on their managerial accompaniment.” He adds: “Employees and managers must adapt AI into their daily lives, and learn how it changes the way they [employees and managers] work and create value. This will demystify the power of AI among employees.”

43% of employees fear that activities will be replaced by automated tasks. On the executive and managerial side, they are more optimistic; artificial intelligence will create new jobs and a hybridization of activities, combining artificial and human intelligence, in the next 5 years to come. This concept is very similar to J. Schumpeter’s “creative destruction” – an economic process that simultaneously sees the disappearance along with the creation of new economic activities.

According to Malakoff Médéric’s study, it is the managerial functions’ responsibility, more specifically HR, to deploy and adopt artificial intelligence within their enterprises. In order for employees to trust in AI, it is up to the HR department to demystify and facilitate its arrival by helping to structure, train and reflect on ethical issues in their processes.

Does artificial intelligence really exist?

Luc Julia, VP Innovation at Samsung Electronics and founder of “Siri,”asks the question: “Can we actually talk about artificial intelligence today?”.

We tend to imagine artificial intelligence as shown in the movies: big and scary intelligent machines that will take over the world. Except, according to Luc Julia, with our current methods: AI doesn’t exist!

Luc Julia begins by giving us some examples of “artificial intelligence”. In 1997, Deep Blue (supercomputer specializing in chess) defeats the chess world champion Garry Kasparov. “Deep Blue’s victory was evident! The machine was programmed to know and anticipate all chess moves prior to the game. We modeled all of the possibilities in chess (10 to the 53rd power), and for one man, that’s way too much!” Julia claims.

Julia also talks about AlphaGo, program that defeated the Go world champion in 2014. “With the game of Go, it’s a bit different because it’s impossible to model all of the possibilities. Some of them were modeled and statistical models filled in the gaps of those they couldn’t model. So there is no intelligence, it’s simply an enormous volume of data and a bit of statistics.” He adds, “AlphaGo is more than 1500 CPUs (processors), 300 GPUs, and 440 kWh. A human is 20 kWh! Plus, a human can do way more than just play a game of Go.”

It is clear that according to Luc Julia, artificial intelligence does not exist and is always mathematically explainable. “Today, what we call artificial intelligence is either expert systems based on rules, or systems that are based on data (machine learning). If we want to use “real” artificial intelligence, we will have to resort to other methods than those used today.

← Previous post Next post →

How to become Data Driven according to Charlotte Tilbury

How to become Data Driven according to Charlotte Tilbury

by Zeenea Software | May 6, 2019 | Data inspiration

Zeenea’s participation in the AI & Big Data Global Expo in London on the 25th and 26th of April has officially opened the window in becoming the leading data catalog solution for data-driven enterprises. Zeenea is confident that the core of every company’s success is the ability to leverage its data assets, which can be achieved by being a truly data-driven enterprise.

During this expo, we attended some Big Data Business Solutions conferences that aimed to inform and educate on how data assets are the make-or-break of successful business decisions. A common theme across the board was how Data Science and Business Analytics are an integral component of adding value within enterprises. But how exactly can this be built into an existing company?

Dr. Andreas Gertsch Grover, the director of Data Science at Charlotte Tilbury shed light on this hot topic in his conference, How small steps get you to the promised land of a data-driven company, by showing us examples of what actually doesn’t work.

A make-up brand’s own sensational makeover

A UK beauty and makeup brand, Charlotte Tilbury is growing at a rapid rate, with a pre-money valuation at $561.22m. With revenues doubling every year, Charlotte Tilbury is headed towards becoming a unicorn company by the end of 2019 [1]. Aiming to be the best selling celebrity make-up brand, the company invested in building a Data Science team in an effort to use prediction models to vamp up their marketing and customer personalization.

With Dr. Andreas Gertsch Grover leading the way, he explains how Charlotte Tilbury has managed to build a data-driven culture to deliver successful data science projects.

The company’s discrepancy between a company’s expectations and a data scientist

“Know the roles you need in the company and not just hire a data scientist,” says Grover. Data Science projects are very complicated and need to involve all employees in the enterprise. To list a few issues data scientists can face when they join a company:

  • There is no Data Science infrastructure.
  • There are loads of data with only some identified areas in need of improvement.
  • Access to data is difficult with no documentation on these data. 

Thus, data scientists are forced to make their own environments and laboriously work on large Data Science projects virtually on their own. But when prediction models are created, they ultimately aren’t used as the company doesn’t even know how to apply it to their particular systems!

So what are the steps that need to be taken to close the gap between a company’s expectations and a data scientist’s role?

The must dos

Grover explains that due to the complex nature of Data Science projects, they must start small and be treated iteratively. By doing this, everyone in the company will be able to be involved in the learning process together. Within this collaborative framework, both employees and business stakeholders will be able to understand the business and ask the right questions, which will lead to the next small, successful project.

The must-haves

Grover supports the necessity of using tools when researching and developing their projects. As data acquisition and exploration can take up an enormous amount of time, by investing in tools that will expedite the process, it will save precious time and improve efficiency. Every person should be able to be independent and find the data they need. This particular need is Zeenea’s main goal by providing a data catalog.

The promised land of a data-driven company

Understanding and managing a company’s expectations is never easy but if everybody in an enterprise works together, the Promised Land of becoming a data-driven company is attainable. By working in small steps, iteratively, employees can learn, collaborate, and deliver major business turnovers that are both tried and true.

Sources

[1] Armstrong, P. (2018, August 13). Here are the U.K. Companies That Will Be Unicorns In 2019 Retrieved from https://www.forbes.com/

← Previous post Next post →

Google Goods: The management and data democratization tool of Google

Google Goods: The management and data democratization tool of Google

by Zeenea Software | Apr 10, 2019 | Data inspiration

When you’re called Google, the data issue is more than just central. A colossal amount of information is generated every day throughout the world, by all teams in this American empire. Google Goods, a centralized data catalog, was implemented to cross-reference, prioritize, and unify data.

This article is a part of a series dedicated to data-driven enterprises. We highlight successful examples of democratization and mastery of data within inspiring companies. You can find the Airbnb example here. These trailblazing enterprises demonstrate Zeenea’s ambition and it’s data catalog: to help organizations better understand and use their data assets.

Google in a few figures

The most-used search engine on the planet doesn’t need any introduction. But what is behind this familiar interface? What does Google represent in terms of market share, infrastructure, employees, and global presence?

In 2018, Google had [1]:

  • l90.6% market share worldwide
  • 30 million indexed sites
  • 500 million new requests every day

In terms of infrastructure and employment, Google represented in 2017 [2]:

  • 70,053 employees
  • 21 offices in 11 countries
  • 2 million computers in 60 datacenters
  • 850 terabytes to cache all indexed pages

Given such a large scale, the amount of data generated is inevitably huge. Faced with the constant redundancy of data and the need for precision for its usage, Google implemented Google Goods, a data catalog working behind the scenes to organize and facilitate data comprehension.

The insights that led to Google Goods

Google possesses more than 26 billion internal data [3]. And this includes only the data accessible to all the company employees.

Taking into account sensitive data that uses secure access, the number could double. This amount of data was bound to generate problems and questions, which Google listed as a reason for designing its tool:

An enormous data scale

Considering the figure previously mentioned, Google was faced with a problem that couldn’t be ignored. The sheer quantity and size of data made it impossible to process all them all. It was hence essential to determine which ones are useful and which ones aren’t.

The system already excludes certain information deemed unnecessary and is successful in identifying some redundancies. Therefore, it’s possible to create unique access roads through data without it being stored in different places within the catalog.

Data variety

Data sets are stocked in a number of formats and in very different storage systems. This makes it difficult to unify data. For Goods, it is a real challenge with a crucial objective: to provide a consistent way to query and access information without revealing the infrastructure’s complexity.

Data relevance

Google estimates that 1 million data are both created and erased on a daily basis. This emphasizes the need to prioritize data and establish their relevance. Some are crucial in processing chains but only have value for a few days, others have a scheduled end of life that can last from several weeks to a few hours.

The uncertain nature of metadata

Many of the data cataloged are from different protocols, making metadata certification complex.  Goods therefore proceeds by trial and error to create hypotheses. This is due to the fact that it operates on a post hoc basis. In other words, collaborators don’t have to change the way they work. They are not asked to combine data sets with metadata when they are created. It is up to Goods to work, collect, and analyze data to bring them together and clarify them for future use.

A priority scale

After working on discovery and cataloging, the question of prioritization arises. The challenge is the ability to respond to this question: “What makes a data important?” Providing an answer to this question is much less simple for an enterprise’s data than prioritizing web research, for example. In an attempt to establish a relevant ranking, Goods is based on the interactions between data, metadata, and other criteria. For instance, the tool considers that data is more important if its author has associated a description to go with it, or if several teams consult, use or annotate it.

Semantic data analysis

Carrying out this analysis allows, in particular, to better classify and describe the data in the search tool. It can thus respond to the correct requested information in the catalog. The example is given in the Google Goods reference article [3]: Suppose the schema of a data set is known and certain fields of the schema take on integer values. Thanks to inference on the data set’s content, the user can identify that these integer values are IDs of known geographical landmarks and then use this type of content semantics to improve geographical data research in the tool.

Google Goods features

Google Goods catalogs and analyzes the data to present it in a unified manner. The tool collects the basic metadata and tries to enrich them by analyzing a number of parameters. By repeatedly revisiting data and metadata, Goods is able to enrich itself and evolve.

The main functions offered to users are:

A search engine

Like the Google we know, Goods offers a keyword search engine to query a dataset. This is the moment when the challenge of data prioritization is taking place. The search engine offers data classified according to different criteria such as the number of processing chains involved, the presence, or the absence of a description, etc.

Data presentation page

Each data has at its disposal a page containing as much information as possible. In consideration that certain data can be linked to thousands of others, Google compresses data upstream recognized as most crucial to make them more comprehensible on a presentation page. If the compressed version remains too large, the information presented keeps only the more recent entries.

Team boards

Goods created boards to distribute all data generated by a team. For example, this makes it possible to obtain different metrics and to connect with other boards. The board is updated each time Goods adds metadata. The board can be easily integrated into different documents so that teams can then share it.

In addition, it is also possible to implement monitoring actions and alerts on certain data. Goods is in charge of the verifications and can notify the teams in case of an alert.

Goods usage by Google employees

Over time, Google’s teams have come to realize the use of its tool as well its scope was not necessarily what the company expected.

Google was thus able to determine that employees’ principal uses and favorite features of Goods were:

Audit protocol buffers

Protocol Buffers are serialization formats with an interface description language developed by Google. It is widely used at Google for storing and exchanging all kinds of information structures.

Certain processes contain personal information and are a part of specific privacy policies. The audit of these protocols makes it possible to alert the owners of these data in the event of a breach of confidentiality.

Data recuperation

Engineers are required to generate a lot of data in the framework of their tests and often forget their location when they need to access it again. Thanks to the search engine, they can easily find them.

Understanding legacy code

It isn’t easy to find up-to-date information on the code or data sets. Goods manages the graphics that engineers can use to track previous code executions as well as the input and output of data sets and find the logic that links them.

Utilization of the annotation system

The bookmark system of data pages is fully integrated to find important information quickly and to easily share them.

Use of page markers

It’s possible to annotate data and attribute different degrees of confidentiality to them. This is so that others at Google can better understand the data they have in front of them.

With Goods, Google achieves prioritizing and unifying data access for all their teams. The system is meant to be non-intrusive and therefore operates continuously and invisibly for users in order to provide them with organized and explicit data. Thanks to this, the company improves team performance, avoiding redundancy. It saves on resources and accelerates access to data essential to the company’s growth and development.

[1] Moderator’s blog: https://www.blogdumoderateur.com/chiffres-google/
[2] Web Rank Info: https://www.webrankinfo.com/dossiers/google/chiffres-cles
[3] https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/45390.pdf

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

Metacat: Netflix makes their Big Data accessible and useful

Metacat: Netflix makes their Big Data accessible and useful

by Zeenea Software | Mar 29, 2019 | Data inspiration

Like many numerous companies, Netflix has a colossal amount of data that come from many different data sources in various formats. As the leading streaming video on demand company (SVOD), data exploitation is, of course, a major strategic asset. Given the diversity of its data sources, the streaming platform wanted a way to federate and interact with these assets using a single tool. This led to the creation of Metacat.

This article explains the motivations behind the creation of Metacat, a metadata solution intended to facilitate the discovery, treatment, and management of Netflix’s data.

Read our previous articles on Google and AirBnB.

 

Netflix’s key figures

Netflix has come a long way since its DVD rental company in the 1990s. Video consumption on Netflix accounts for 15% of global internet traffic. But Netflix today is also:

 

  • 130 million paying subscribers worldwide (400% increase since 2011)
  • $10 billion turnover, including $403 million in profits
  • $100 billion market capitalizations, or the sum of all the leading television groups in Europe
  • $6 billion investment in original creations (TV shows and movies).

Netflix is also a data warehouse of 60 petabytes (60 million billion bytes), which is a real challenge for the firm to exploit and federate these data.

 

Netflix’s Big Data platform architecture

 

netflix metacat architecture

Its basic architecture includes three key services. These are the Execution Service (Genie), the Metadata Service (Metacat), and the Event Service (Microbot).

 

data sources netflix metacat

In order to operate between its different languages and data sources, which are not very compatible with each other, Metacat was born. This tool acts as a data and metadata access layer from Netflix’s data sources. A centralized service accessible by any data user in order to facilitate their discovery, treatment, and management.

 

Metacat & its features

Netflix has data queries, such as Hive, Pig, or Spark, that are not operable together. By introducing a common abstraction layer, Netflix can provide data access to its users, regardless of their storage systems.

In addition, Metacat goes so far as to simplify transferring one dataset to a datastore to another.

 

Business metadata

Hand-written, user-defined, business-oriented metadata, in free format can be added via Metacat. Its main information includes the connections, configurations, metrics, and the life cycles of each dataset. 

Data discovery

By creating Metacat, Netflix makes it easy for consumers to find business datasets. The tool publishes schema and business metadata defined by its users in Elasticsearch, making it easier to find full-text information in its data sources.

Data modification and audit

As a cross-functional tool for all data stores, Metacat registers and notifies all changes made to the metadata and the data itself from its storage systems.

 

Metacat and the future of Netflix

According to Netflix, the current version of Metacat is a step towards the new features they are working on. They still want to improve the visualization of their metadata, as it would be very useful for restoration purposes.

Metacat, according to Netflix, should also be able to have a plug-in architecture. Thus, their tool could validate and maintain all of its metadata. This is because users define metadata in free form. Therefore, Netflix needs to put into place a validation process that can be done before storing the metadata.

As a centralizing tool for multi-source and multi-format data, Netflix’s Metacat has clearly made progress.

The development of this in-house service has adapted to all the tools used by the company, allowing Netflix to become Data Driven.

 

Sources

  • Metacat: Making Big Data Discoverable and Meaningful at Netflix https://netflixtechblog.com/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
  • La folie Netflix en cinq chiffres https://www.lesechos.fr/tech-medias/medias/la-folie-netflix-en-cinq-chiffres-1132022

 

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper

Metadata management : a trending topic in the data community

Metadata management : a trending topic in the data community

by Zeenea Software | Mar 28, 2019 | Data governance, Data inspiration

On the 4th, 5th and 6th of March, Zeenea had the opportunity to attend the famous Data & Analytics Summit in London organized by Gartner. This is an indispensable and inspiring event for Chief Data Officers and their teams in the implementation of their data strategy.

This article outlines many concepts from the conference: “Metadata Management is a Must-Have Discipline” by Alan Dayley, Gartner Analyst. This subject has attracted the attention of many C-Levels, confirming that metadata management is a top priority for the years, even months, to come.

 

The concept of metadata applied to our daily lives

To introduce the concept of metadata, the speaker made an analogy to a situation that is known to all of us and that is becoming more and more important in our daily lives: to identify and select what we eat.

Take the example of a meal composed of many different ingredients that have been significantly modified. It’s thanks to the different labels, pricing schemes, and descriptions on a product’s packaging that consumers are able to identify what they have on their plates.

This information is what we call, metadata!

How do metadata bring value to an enterprise?

Applying metadata on data allows the enterprise to contextualize its data assets. Metadata addresses different subjects, gathered within four different categories: Data Trust, Regulations & Privacy, Data Security, and Data Quality.

The implementation of a metadata management strategy depends on finding the balance between the identified business needs within the company and the regulations associated with data risks.

In other words, where should you invest your time and money? Should you democratize data access to your data teams (data scientists, data engineers, data analysts or data experts) to increase in productivity or to concentrate on the demands of regulatory bodies such as the GDPR, to avoid a hefty fine?

The answer to these questions is specific to each enterprise. Nevertheless, Alan Dayley highlights four use cases, identified as top priority cases by CDOs, where metadata management should be the key:

 

1. Data governance

In this particular use case, the speaker confirms that data governance can no longer be thought of in a “top-down” manner. Data cross-references different teams and profiles with distinct roles and responsibilities. In light of this, everyone must work together to inform and complete their data’s information (its uses, its origin, its process, etc.). Contextualizing data is a fundamental element to establishing effective and easy data governance!

 

2. Risk management and compliance

The information requested below have been enforced since the arrival of the GDPR. Enterprises and their CDOs must:

  • Define the responsibilities linked to their data sets.
  • Map their data sets.
  • Understand and identify the processing operations on the data and associated risks.
  • Have a processing and/or a data lineage register.

3. Data analysis

By addressing data governance in a more collaborative way and by favoring interactions between data users, the enterprise will benefit from collective intelligence and continuous improvement on the understanding and analysis of a data set. In other words, it’s extracting previous discoveries and experimentations from pertinent information for the next data users.

 

4. Data value

In the quest for data monetization, data will have no value, so to speak, unless the information around it is:

  • measured: by its quality, its economic characteristics, etc.
  • managed: the persons in charge, documentation provided, its updates, etc.

 

How to establish metadata management?

No matter your enterprise’s objectives, you can not reach them without metadata management. Therefore, the answer to those questions is indeed metadata!

Our recommendations to be able to undertake this exercise would be to:

  • Hire the right sponsor that values a metadata-centric approach in the enterprise.
  • Identify the main use case that you want to treat first (as defined above).
  • Check that the efforts made in terms of metadata are not isolated but are centralized and unified.
  • Select a key metadata management solution on the market, such as a data catalog.
  • Define where, who, and how you will start.

To conclude this article, not having metadata management is like driving on a road with no signs. Be careful not to get lost!

Start metadata management
← Previous post Next post →

How Big Data & Machine Learning contributed to Zalando’s success

How Big Data & Machine Learning contributed to Zalando’s success

by Zeenea Software | Mar 21, 2019 | Data governance, Data inspiration

For the second year in a row, Zeenea participated at Big Data Paris as a sponsor this past 11th and 12th of March to present its’ data catalog.

During the event, we were able to attend to many different conferences presented by professionals in the data field : chief data officers, business analysts, data science managers, etc…

Among those conferences, we had the opportunity to attend the Zalando conference, presented by Kshitij Kumar, VP Data Infrastructure.

 

Zalando: the biggest eCommerce plateform in Europe

With more than 2,000 different brands and 300,000 items available, the German online fashion platform conquered 24 million active users in 17 European countries since its’ creation in 2008 [1].

In 2018, Zalando earned about € 5,4 billion : a 20% increase since the year 2017 [2]!

With these positive results, Zalando has a lot of hope for the future. Their objective is to become the fashion reference :

“We want to become an essential element to the lives of our customers. Only a handful of apps make it to being part of a customer’s life such as Netflix for television or Spotify for music. We aim to be this one fashion destination where the customer can fulfil all of their fashion needs. [3]”

explains David Schneider, co-CEO of Zalando.

But how was Zalando able to become so successful in such little time? According to Kshitij Kumar, it is a question of data.

Zalando on the importance of being a data-driven enterprise

“Everything is based on data.” states Kshitij Kumar during his conference Big Data Paris this past March. For 20 minutes, he explains that everything must revolve around data : business intelligence and machine learning are built based on the company’s data.

With more than 2,000 technical employees, Zalando claims a Big Data infrastructure in different categories :

 

Data Governance

In response to the GDPR, the VP Data Infrastructure explains the importance of establishing data governance with the help of a data catalog: “It is essential to an organization in order to have safe and secure data.”

 

A machine learning platform

It’s by exploring, working, curating and observing your data that a machine learning platform can be efficient.

 

Business intelligence

It’s by putting into place visual KPIs and trusted datasets that BI can be proactive.

 

Zalando’s Machine Learning evolution

Kshitjif reminds us that with Machine Learning, it is possible to collect data in real time.

In the online fashion industry, there are many use-cases: size recommendation, search experience, discounts, delivery time, etc…

Interesting questions were then brought up: How can you know exactly what a customer’s taste is? How to know exactly what he could want?

Kumar answers by telling us that it’s by repeatedly testing your data:

“Data needs to be first explored, then trained, deployed and monitored in order for it to be qualified. The most important step is the monitoring process. If it is not successful, then you must start the machine learning process again until it is.”

Another benefit in Zalando’s data strategy is their return policy. Customers have 100 days to send their items back. Thanks to these returns, Zalando can gather data and therefore, better target their clients.

 

Zalando’s future

Kshitij Kumar tells us that by 2020, he hopes to have an evolved data structure. “

In 2020, I envision Zalando to have a software or program that allows any user to be able to search, identify and understand data. The first step in being able to centralize your data is by having a data catalog for example. With this, our data community can grow through internal and external (vendors) communication.”

 

Sources

[1] “L’allemand Zalando veut habiller l’Europe – JDD.” 18 oct.. 2018, https://www.lejdd.fr/Economie/lallemand-zalando-veuthabiller-leurope-3779498.

[2] “Zalando veut devenir la référence dans le domaine de la mode ….” 1 mars. 2019, http://www.gondola.be/fr/news/non-food/zalando-veut-devenir-la-reference-dans-le-domaine-de-la-mode.

[3] “Zalando Back in Style as It Bids to Be Netflix of Fashion – The New ….” 28 févr.. 2019, https://www.nytimes.com/reuters/2019/02/28/business/28reuters-zalando-results.html.

← Previous post Next post →

McDonald’s France : How Big Data changed the firm

McDonald’s France : How Big Data changed the firm

by Zeenea Software | Mar 18, 2019 | Data inspiration

As a sponsor for the 2019 “Big Data Paris” trade show, Zeenea had the opportunity to attend to many conferences such as BCG, Zalando, the Ministry of Armed Forces, etc.

Among these conferences, we were also able to participate to the one about the fast-food giant McDonald’s France. Presented by Romain Girard, Business Insight Director at McDonald’s France, and Thibault Labarre, Senior Manager at Ekimetrics, the speakers were able to enlighten us on how McDonald’s France uses big data in order to better know their consumers.

 

McDonald’s objectives and challenges they face

In France, McDonald’s has over 1450 restaurants and about 2 million customers a day ! Clearly, treating data is a very complex task for the multi-million dollar company. Romain Girard explains:

“With that many clients a day, it is important for us to be able to distinguish the different customer profiles. To do this we use Big Data.”

The food industry is a very competitive environment with new players appearing every day. In France, the apparition of overseas fast-food chains such as Burger King, O’Tacos or even the establishment of eating areas in supermarkets like Franprix or Carrefour Market, give a lot of power to consumers offering a wide-range of restaurants to choose from.

McDonald’s objective is to be number one in the fast-food industry. However, with new eating habits (vegetarianism, veganism), new ways of food delivery (platforms such as Uber Eats, Deliveroo, etc.) as well as digitalization (websites such as La Fourchette), it is more and more difficult to face the competition. McDonald’s France’s answer is found in Big Data with customer segmentation in order to offer more innovative and personalized offers.

How does McDonald’s France distinguish their different consumer profiles?

In order to distinguish their different consumer profiles, McDonald’s France uses their receipts! The receipts have a lot of precious information in order to better know and understand their customers: the time of day of the order, the number of items that were bought, if the order was consumed on site or to go etc.

Thibault Labarre explains:

“In order for us to distinguish these different profiles, we exploit the data to create a data ecosystem to then be able to cross them : how many of our clients order take out? How many clients come alone? And at what time of day? Etc.”

Romain Girard completes this statement by stating that it is “important to create an acculturation between the data teams and the business teams in order to establish a common pedagogy.”

To establish this common pedagogy, McDonald’s France uses a simple and easy-to-use dashboard so that any user can understand the company’s data.

“It’s by working in an agile manner that our teams can communicate around our data efficiently. Therefore, our data strategy was put into place in only three months! It’s strange to say, but we set up a “start-up” way of working in order to quickly test and learn about our data.” explains Thibault Labarre.

 

What’s next for McDonald’s?

McDonald’s is not even close to ending things with Big Data; the fast-food giant confirms that in just a few months, their communication strategy will be very different.

“Of course we cannot reveal too much information, but just know that our communication will be more centered around the client than the actual product itself.” states Romain Girard, “Come as you are” (McDonald’s France’s slogan – “Venez comme vous êtes”) speaks directly to the customer and that is exactly what we want to do.”

← Previous post Next post →

Data portal, the data centric tool of AirBnB

Data portal, the data centric tool of AirBnB

by Zeenea Software | Feb 18, 2019 | Data inspiration

AirBnB is a burgeoning enterprise. To keep pace with their rapid expansion, AirBnB needed to really think about data and the extension of its’ operation. The Data Portal was born from this growing momentum, a fully Data-Centric tool at the disposal of employees.

This article is the first of a series dedicated to Data-Centric enterprises. We will shed light on successful examples of the democratization and the mastery of data within inspiring organizations. These pioneering enterprises demonstrate the ambition of Zeenea’s data catalog: to help each structure to better understand use their data assets.

Airbnb today:

In a few years, AirBnB has secured their position as a leader of the collaborative economy around the world. Today they are among the top hoteliers on the planet. In numbers [1], they represent:

  • 3 million recorded homes,
  • 65,000 registered cities,
  • 190 countries with AirBnB offers,
  • 150 million users.

France is its’ second largest market behind the United States. It alone counts for more than 300,000 homes.

The reflections that led to the Data Portal

During a conference held in May 2017, John Bodley, a data engineer at AirBnB, outlined new issues arising from the high growth of collaborators (more than 3,500) and the massive increase in the amount of data, from both users as well as employees (more than 200,000 tables in their Data Warehouse). This is a confusing and divided landscape that doesn’t always allow access to increasingly important information.

How to combine success with a very real management problem with data? What to do with all this information collected daily and this knowledge both at the user and collaborator level? How can they be transformed into a force for all airbnb employees?

Here are the questions asked that led to the creation of the data portal.

Beyond these challenges, a problem of overall vision has been imposed on the company.

Since its creation in 2008, AirBnB has always paid great attention to their data and their operations. This is why a dedicated team has positioned themselves for the battle to develop a tool that democratizes data access within the enterprise. Their work is simultaneously founded on analysts’ knowledge and their ability to understand the critical points as well as on their engineers who also offer a more concrete vision of the whole. At the heart of the project, an in-depth survey of employees and of their problems were conducted.

From this survey, one constant emerged: a difficulty of finding information, which the collaborators need in order to work. The presence of tribal knowledge, kept by a certain group of people, is both counter-productive and unreliable.

The result: The necessity of raising questions to colleagues, the lack of trust in the information (data’s validity, impossible to know if the data is up-to-date) and consequently, the creation of new, but duplicate data, which astronomically increases the already existing quantity.

To respond to these challenges, AirBnB created the Data Portal and released it to the public in 2017.

Data Portal, Airbnb’s data catalog

To give you a clear picture, the Data Portal could be defined as a cross between a search engine and a social network.

It was designed to centralize absolutely all incoming data, whether they come from employees or users, by the enterprise. The goal of the Data Portal is to be able to return this information, in graphic form, to whichever employee needs it.

This self-service system allows collaborators to access necessary information by themselves for the development of their projects. Beyond data itself, the Data Portal lets you obtain contextualized metadata. The information is provided with a background that allows you to valorize the data better and to understand it as a whole.

The Data Portal was designed in a collaborative approach.

With this in mind, it helps you to visualize within data all the interactions between the different collaborators of the enterprise. Thus, it is possible to know who is connected to which data.

The Data Portal and a few of its features

The Data Portal offers different features to access data in a simple and fun way, offering the user an optimal experience. You can see pages dedicated to each data set or a significant amount of metadata linked to it.

 

  • Research: Chris Williams, an engineer and a member of the team in charge of developing the tool, speaks of a “Google-esque” feature. The search page allows you to quickly access data, to graphics, and also to the people, groups, or relevant teams behind the data.
  • Collaboration: All in one sharing approach and implementing a collaborative tool, data can be added to a user’s favorites, pinned on a team’s board, or shared via an external link. Just like a social network, each employee also has a profile page. As the tool is accessible to all collaborators and intended to be completely transparent, it also includes all the members in the hierarchy. Former employees continue to have a profile with all created and used data. Always in a logic of information decompartmentalization and doing away with tribal knowledge.

  • Lineage: It is also possible to explore data’s hierarchy by viewing both parent and child data.

  • Groups: Teams spend a lot of time exchanging around the same data. To enable each to share information more quickly and more easily, the possibility to create working groups was implemented in the Data Portal. Thanks to these pages, a team’s members can organize their data, easily access them, and encourage sharing.

Within the tool

 

Democratizing data has several virtues. First off, this avoids creating dependence on information. An umbrella system weakens the enterprise’s equilibrium. If the information and the understanding of data are only held by one group of people, the dependency ratio becomes too high.

In addition, it is important to simplify the understanding of data so that the collaborators can operate them better.

Globally speaking, the challenge for AirBnB is also to improve the trust in data for all their collaborators. So that each can be assured they are working with the correct information, updated, etc.

AirBnB is no fool and the team behind the Data Portal knows that the handling of this tool and its wise utilization will take time. Chris Williams put it this way: “Even if asking a colleague for information is easy, it is totally counterproductive on a larger scale.”

To change these habits, take the first step to consult the portal rather than directly exchanging will require a little effort from collaborators.

The vision of the Data Portal over time

To promote trust in the supplied data, the team wants to create a system of data certification. It would make it possible to certify both the data and the person who initiated the certification. Certified content will be highlighted in the search results.

Over time, AirBnB hopes to develop this tool at different levels:

  • Analysis of the network in order to identify obsolete data.

  • Create alerts and recommendations. Always with an explorative approach, the tool could possibly become more intuitive suggesting new content or updates on data accessed by a user.

  • Making data pleasant. To create an appealing setting for the employees by presenting, by example, the most viewed chart of the month, etc.

With the Data Portal, AirBnB pushes the use of data to the highest level.

The democratization of all employees makes it possible to make them more autonomous and efficient in their work and also reconstructs the enterprise’s hierarchy. And with more transparency, it will also become less dependent. The collaborative takes precedence over the notion of dedicated services. And the use of data reinforces enterprises’ strategy for their future development. A logical approach that it is a part of and is promoted among their customers.

Sources

[1] https://www.usine-digitale.fr/article/le-succes-insolent-d-airbnb-en-5-chiffres-cles.N512814
[2] Slides issues de la conférence « Democratizing Data at AirBnB » du 11 mai 2017 : https://www.slideshare.net/neo4j/graphconnect-europe-2017-democratizing-data-at-airbnb
https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770

Democratizing Data at Airbnb from Neo4j

https://searchcio.techtarget.com/feature/Airbnb-capitalizes-on-nearly-decade-long-push-to-democratize-data

Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”

Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.

data-discovery-mockup-EN-no-shadow
download our white paper
← Previous post Next post →

Recent Posts

  • New white paper series : “Unlock data” no matter your industry
  • Manufacturing Data Success Story: Total
  • IoT in manufacturing: why your enterprise needs a data catalog
  • How has data impacted the manufacturing industry?
  • Product spotlight: confidentiality, knowledge graphs and exploration will bring satisfaction to your data stewards and your data teams in 2021

Categories

  • Data catalog
  • Data governance
  • Data inspiration
  • GDPR
  • Metadata management
  • Metadata Management
  • News & events
  • Zeenea product