Data-as-a-Service and data quality supporting innovation: the new Data Management platforms
The new generation of data platforms is being deployed with the motto “self-service data, for everyone, in real time”. The meaning that needs to be given to the many initiatives we are seeing in the data market shows the desire to make business users more autonomous in their use of data. This does not mean that technology and IT are fading away. On the contrary, they have never been more present with a major objective: data quality. And the distribution of roles and a new governance system that is being implemented to facilitate the use of artificial intelligence are also being put in place to ensure this quality.
The organization of professional data
The Big Data moment was characterized by the 5Vs (or 3, or 7, we didn’t really know anymore): Velocity, Variety, etc…
If we keep in mind that this moment lasts and that it evolves over time, we can say that all the Vs did not have the same importance at different moments in the deployment of Big Data. Clearly, Volume was the overriding concern: we wanted to build platforms capable of handling volumes never seen before.
Velocity and Variety are beginning to be really taken into account in the sense that they are finally getting organized:
● Velocity thanks to Streaming platforms,
● Variety thanks to the metadata modeling possible at the Data Catalog level.
There is the question of better organizing the assets, better exploiting them and increasing the quality of data in a more industrial context and with better access to tools for professionals.
These two topics are particularly hot. The first is technical. The second is more focused on the description of information, and is a total response to the major challenge of Data-as-a-Service.
Data Catalogs are metadata directories. They describe the information, identify and locate it, sometimes also allowing, via governance functionalities, to trace the transformations that data undergoes through its multiple treatments, thus contributing to the objective of Data Quality.
Business combinations and information system mergers are one of the levers for the implementation of these catalogs, whose market is still fragmented by the very heterogeneous origin of the players that make it up.
The size of the reference data also remains essential, and the quality of the data depends on a clear construction of the reference data and their proper use to feed the entire information system. This field, which has existed for some fifteen years, remains at the center of reflection and work, and demonstrates all its importance in a strategy of transversal information system.
Data Analysis and Data Science remain leading areas of Data and are being modernized.
It is not always easy to be creative if technology constrains us too much in the implementation of our use cases. What analysts now want is the means to quickly and easily apply statistical algorithms to their data, to build models quickly and determine their reliability immediately.
Although knowledge of Python and R remains central, the use of Data Science-in-a-box platforms such as Tibco Data Science, Dataiku and Alteryx help accelerate analysis. At stake are the reliability of the model and the rapid detection of correlations between variables.
If the data is readily available (notably thanks to data virtualization technologies), and if the question is no longer how to build a model (Python or Alteryx does it for you), then it becomes simpler to multiply tests to find usable models (for prediction or prescription purposes).
Reducing the time spent building models helps to maintain the creativity of teams and the technology can be used to support ideation sessions where it becomes easier to test all the ideas that come to mind.
This also liberates time for DataViz, which remains a field in its own right and covers more methodological and didactic aspects than before. Storytelling has become a subject in its own right and it is important to know how to build your discourse around representations, by communicating hypotheses.
The data value chain remains simple, but the tools that make it up have multiplied and become more complex. In spite of the emergence of de facto standards, it is important to maintain a thorough knowledge of the tools, platforms and their positioning, in order to build efficient data acquisition and exploitation chains that are likely to cover all use cases.
In this context, several bricks currently stand out:
● Storage, where technologies are diversifying. Our recent studies around VoltDB, MongoDB or DataStax show the dynamism that exists in this sector and the need to closely monitor its evolution,
● Streaming, for real-time feeding of data consumers,
● Data virtualization, which creates secure and up-to-date business views without creating new databases, and exposes its views in the form of APIs, reinforcing the information system’s APIzation strategy.
These platforms are being consolidated and enriched to provide turnkey development workshops, designed to improve access performance, data governance and data quality.
This would be incomplete without mentioning the industrialization of data engineering chains, which require a high degree of specialization, involving in-depth reflection on roles and processes, and therefore on the governance of data management. It is a subject we have addressed in a white paper six years ago, which requires a complete overhaul, and which will be updated by the end of the year.
To help our customers build and deploy a data strategy that serves all users, it is important to consider all of these aspects and the construction of these new-generation platforms from the perspective of a comprehensive, iterative program, federated around data quality as a transverse discipline.