Data life cycle analysis is an element in data management that enterprises are still struggling to implement.
Organizations at the forefront of data innovation such as Uber, LinkedIn, Netflix, Airbnb and Lyft have also seen the value of metadata in the magnitude of this challenge.
They thus developed a metadata management strategy using dedicated platforms. Frequently developed on a custom basis, they facilitate data ingestion, indexing, search, annotation and discovery in order to maintain high quality datasets.
The following examples highlight a shared constant: the difficulty, increased by volume and variety, of transforming business data into exploitable knowledge.
Let’s take a look at the analysis and context of these Web giants:
Every interaction on Uber’s platform, from their ride sharing services to their food deliveries, is data-driven. Through analysis, their data enables more reliable and relevant user experiences.
Uber’s key stats
- thousands of billions of Kafka messages a day,
- hundreds of petabytes of data in HDFS in data centers,
- millions of analytical queries weekly.
However, the volume of data generated alone is not sufficient to leverage the information it represents; to be used effectively and efficiently, data requires more context to make optimal business decisions.
To provide additional information, Uber therefore developed “Databook”, the company’s internal platform that collects and manages metadata on internal datasets in order to transform data into knowledge.
Databook is designed to enable Uber employees to effectively explore, discover and use Uber’s data. Databook gives context to their data (its meaning, quality, etc) and ensures that it is maintained in its platform for the thousands of employees who want to analyze the data. In short, Databook’s metadata enables data leaders to move from viewing raw data to actionable knowledge.
In the article Databook: Turning Big Data into Knowledge with Metadata at Uber, the article concludes that one of the biggest challenges for Databook was to move from manual metadata repository updates to automation.
At a conference in May 2017, John Bodley, Data Engineer at AirBnB, outlined new issues arising from the company’s growth: a confusing and non-unified landscape that wasn’t allowing access to increasingly important information.
What can we do with all this data collected on a daily basis? How do we turn them into assets for all Airbnb employees?
A dedicated team set out to develop a tool that would democratize access to data within the company. Their work was based both on the knowledge of the analysts and their ability to understand the critical points, and on that of the engineers, who were able to offer a more technical vision. At the heart of the project, interviews of employees concerning their issues were conducted.
What emerged from this survey was a difficulty in finding the information employees needed to work, and a still too tribal approach to sharing and holding information.
To meet these challenges, AirBnB created Data Portal, a metadata management platform. Data Portal centralizes and shares this information via this self-service platform.
Lyft is a ride-sharing service and is Uber’s main competitor in the North American market.
The company found they were inefficiently providing data access for its analytical profiles. Its reflections focused on making data knowledge available to optimize its processes. In just a few months, their goal of creating an interface for researching data presented these two major challenges:
- Productivity – Whether it’s to create a new model, instrument a new metric, or perform an ad hoc analysis, how can Lyft use this data in the most productive and efficient way possible?
- Compliance – When collecting data about an organization’s users, how can Lyft comply with increasing regulatory requirements and maintain the trust of its users?
In their article Amundsen – Lyft’s data discovery & metadata engine, Lyft states that the key does not lie in the data, but in the metadata!
As the world leader in video streaming, data exploitation at Netflix is, of course, a major strategic focus.
Given the diversity of their data sources, the video platform wanted to offer a way to federate and interact with these assets from a single tool. This search for a solution led to Metacat.
This tool acts as a layer of access to data and metadata from Netflix data sources. It allows its users to access data from any storage system through three different features:
- Adding business metadata: By hand or user-defined, business metadata can be added via Metacat.
- Data discovery: The tool publishes schema and business metadata defined by its users in Elasticsearch, facilitating full-text search of information in data sources.
- Data Change Notification and Auditing: Metacat records and notifies all changes to metadata from storage systems.
In their blog article, “Metacat: Making Big Data Discoverable and Meaningful”, at Netflix, the firm confirms that they are far from finished working on their solution!
There are a few more features they have yet to work on to improve the data warehousing experience:
- Schema and metadata visioning to provide table history.
- Provide contextual information on arrays for better data lineage.
- Add support for datastores like Elasticsearch and Kafka.
Learn more about data discovery solutions in our white paper: “Data Discovery through the eyes of Tech Giants”
Discover the various data discovery solutions developed by large Tech companies, some belonging to the famous “Big Five” or “GAFAM”, and how they helped them become data-driven.