The use of massive data by the internet giants in the 2000s was a wake-up call for enterprises: Big Data is a lever for growth and competitiveness that encourages innovation. Today, enterprises are re-organizing themselves around their data in order to adopt a “data-driven” approach. It’s a story constituting several twists and turns that tends to finally find a solution.
This article discusses the different enterprise data revolutions undertaken in recent years up to now, in an attempt to maximize the business value of data.
Siloed architectures
In the 80s, Information Systems developed immensely. Business applications were created, advanced programming language emerged, and relational databases appeared. All these applications stayed on their owners’ platforms, isolated from the rest of the IT ecosystem.
For these historical and technological reasons, an enterprise’s internal data were distributed in various technologies and in heterogeneous formats. In addition to organizational problems, we then speak of a tribal effect. Each IT department have their own tools and implicitly, manage their own data for their own uses. We are witnessing a type of data hoarding within organizations. To back these suggestions, we frequently recall Conway’s law: “All architecture reflects the organization that created it.” Thus, this organization, called silos, makes for very complex and onerous cross-referencing of data originating from two different systems.
The search for a centralized and comprehensive vision of an enterprise’s data will lead Information Systems to a new revolution.
The concept of a Data Warehouse
By the end of the 90s, Business Intelligence was in full swing. For analytical purposes and with the goal of responding to all strategic questions, the concept of a data warehouse appeared.
To make this, we will recover the data from mainframes or relational databases and transfer them to an ETL (Extract Transform Loader). Projected in a so-called pivot format, analysts and decision-makers can access data collected and formatted to answer pre-established questions and specific cases of reflection. From the question, we get a data model!
This revolution always comes with some problems…Using ETL tools has a certain cost, not to mention the hardware that comes with it. The elapsed time between the formalization of the need and the receipt of the report is time-consuming. It’s a revolution that is costly for perfectible efficiency.
The new revolution of a data lake…
The arrival of data lakes reverses the previous reasoning. A data lake enables organizations to centralize all useful data storages, regardless of their source or format, for a very low cost. . We stock an enterprise’s data without presuming their usage in the treatment of a future use case. It is only according to a specific use where we will select these raw data and transform them into strategic information.
We are moving from an “a priori” to an “a posteriori” logic. This revolution of a data lake focuses on new skills and knowledge: data scientists and data engineers are capable of launching the treatment of data, producing results much faster than the time spent using data warehouses.
Another advantage of this Promised Land is its’ price. Often offered in an open-source way, data lakes are cheap, including the hardware that comes with them. We often speak of community hardware.
… or rather a data swamp
Certain advantages are present with the data lake revolution but come along with new challenges. The expertise needed to instantiate and to maintain these data lakes are rare and thus, are costly for enterprises. Additionally, pouring data in a data lake day after day without efficient management or organization brings on the serious risk of rendering the infrastructure unusable. Data are then inevitably lost in the mass.
This data management is accompanied by new issues related to data regulation (GDPR, Cnil, etc.) and data security: already existing topics in the data warehouse world. Finding the right data for the right use is not yet an easy thing to do.
The settlement: constructing Data Governance
The internet giants understood that centralizing these data is the first step, however insufficient. The last brick necessary to go towards a “data-driven” approach is to construct data governance. Innovating through data requires greater knowledge of these data. Where are my data stored? Who uses them? With which goal in mind? How are they being used?
To help data professionals chart and visualize the data life cycle, new tools have appeared: we call them, “Data Catalogs.” Located above data infrastructures, they allow you to create a searchable metadata directory. They make it possible to acquire a business vision and data techniques by centralizing all collected information. In the same way that Google doesn’t store web pages but rather, their metadata to reference them, companies must also store their data’s metadata in order to facilitate the exploitation of and discovery of them. Gartner confirms this in their study, “Data Catalog is the New Black”: if your data lake’s data is without metadata management and governance, it will be considered inefficient.
Thanks to these new tools, data becomes an asset for all employees. The easy-to-use interface doesn’t require technical skills, becoming a simple way to know, organize, and manage these data. The data catalog becomes the reference collaborative tool in the enterprise.
Acquiring an all-round view of these data and to start data governance to drive ideations thus becomes possible.