Data engineering and MLOps: the data-first approach to machine learning
You’ve probably heard this one before: data science is 80% data preparation and cleansing, and 20% generating actual insights. Curiously, the majority of data science teams seem to focus on developing better algorithms and improving models instead of improving data quality. If you’re looking to operationalize and industrialize AI, however, shifting your focus to data engineering is an absolute must to achieve greater accuracy.
The idea of shifting from a model-centric approach to a data-centric approach comes from Andrew Ng, co-founder of Google Brain and one of the world’s most influential computer scientists. He labeled this approach ‘MLOps’, which focuses on building and deploying machine learning models more systematically not just by improving the code, model or algorithm, but the data as well. In other words: it places a stronger emphasis on data engineering.
What is data engineering?
In many ways, data engineering is the so-called ‘boring part’ of data science. Its main goal is to ensure that the input data adheres to certain quality standards and is presented in a structured way. The data engineer and data scientist are sometimes two different people, but they don’t have to be. At delaware, for example, data engineering and data science tasks are often handled by the same people, as this allows them to keep a holistic overview of the entire process.
Many organizations tend to underestimate the time and resources they should invest in data engineering. By now, most teams are pretty used to uploading and deploying new machine-learning models with existing data sets in only a few hours. But properly assessing, cleaning, structuring and preparing the data itself can take days, weeks, and even months – and it’s a process that’s never really finished.
The growing importance of data quality
As mentioned above, the primary focus of machine learning development today is improving the model’s use of the same data sets. As a result, competition is mainly centered around who has the most advanced model, since that will determine who makes the most accurate predictions. When machine-learning models go past the proof-of-concept phase and are operationalized, however, improving the algorithm can only result in incremental accuracy gains. At this stage, improving the data set itself makes a lot more sense.
As AI advances, the importance of data quality increases further. This is especially true in business processes and continuous processes. In the real world, data is never entirely clean. Something is always going on: there are gaps, a sensor isn’t calibrated correctly, etc. Thus, a data engineer’s work is never quite finished.
This shift of focus towards data quality not only benefits existing models but also changes the way data-driven innovation works. Reducing the dependency on ‘big’ data lowers the barrier to entry and thus increases innovation agility. In other words, it provides more flexibility in exploring new ways forwards. As an added bonus, it makes your models easier to manage, explain and maintain: they become more resilient and more rubust, which in turn drives user trust and thus adoption.
Achieving data quality and consistency
If all of that sounds like too much effort, we’ve got some good news for you: there are various solutions that can help make data engineering easier and prevent you from having to reinvent the wheel every time. At delaware, we use frameworks that include data quality screening rules. These scan the data to see if it adheres to certain pre-defined quality standards, including those pertaining to format, acceptable value ranges, business logic, and much more.
What’s more, these frameworks aren’t just useful for traditional, structured data: they work for unstructured data too. In quality control, for example, computer vision can help you identify defective products.
How to deploy machine learning efficiently: 6 rules of thumb
To help companies deploy machine learning efficiently – even in a context where data points are scarce – Andrew Ng has condensed his broader vision on MLOps into six rules of thumb:
- The main objective of MLOps is making high-quality data available.
- Consistency is key when labeling data.
- Systematic improvement of data quality trumps the use of state-of-the-art models.
- Go for a data-centric approach when errors occur during training.
- A data-centric view leaves room for improvement in small datasets.
- The smaller the data set, the more important tools and services used to promote data quality become.
AI isn’t just about developing the fanciest technology and letting it do the work for you. At delaware, we handle the entire process, from preparing the data to operationalizing the technology and embedding it in your long-term strategy.