Why your data science needs more Ops

Investigating a challenge: how to effortlessly migrate machine learning models to production?

The last five years have seen a tremendous increase in data science efforts. Not surprisingly, as not only the field has advanced a lot in the areas of Machine Learning and Artificial Intelligence, but also more and more companies realize the value it can bring to invest in embedding advanced data science in their organization. As many future products and applications will have a form of data science, machine learning, or deep learning in the background, it is difficult to imagine a world without it.

 

However, we are not there yet. The current trend in data-driven products has mainly been driven by successful experiments and promising proofs-of-concept. But to take the next step and bring these PoC’s to production, exposes a new challenge: how to effortlessly migrate models and machine learning algorithms to a reliable production-environment?

Unsurprisingly, 20% of the time usually goes into developing the algorithm and 80% into getting it to run stable and reliable in a production environment.

Fortunately, it does not have to be this way. Let’s see what we could learn from some proven DevOps best practices that will make deployment and management of data science and Artificial Intelligence applications a lot easier in the future.

Data science vs. Software development

As a data scientist, you want to solve problems, design experiments, and create the best model for the job. You don’t want to wrap your head around things like managing cloud infrastructure, packaging, deployment procedures, and monitoring for your model. (Unless you are really into these things of course).

The data science lifecycle: Moving from experiments to a robust application, by Dutch Analytics
The data science lifecycle: Moving from experiments to a robust application, by Dutch Analytics.
                                                                                      

Many data scientists now hand over their algorithm to a data engineering/software department to host it or integrate it with big data sources and orchestration frameworks. As there are quite often not yet frameworks where the algorithms fit in, the engineers will have to learn using this new system. If they already know how to deploy using the aforementioned methods, then still there is a handover. Software guys will look at software reliability, not at data reliability.

First, let’s get to the core of the issue. Data science is different from traditional software development for a few fundamental reasons:

  • The performance of data science models is not just dependent on their source code but, of course, also on the data. As data science models are optimized based on the data used for training them, performance issues can occur when the underlying distribution of the data changes, or if any characteristics diverge from where the model is optimized for. When a model is operational on live data, fluctuating performance is much more prevalent as the live data will almost certainly diverge from the training data.
  • The need for continuous optimization. Usually the development of a data science model does not stop after deployment. Both on the software level and on algorithm level improvements are very likely to happen. Models might require retraining or parameter updates as they need to move along with the changing world around us.
  • Data science code is built on experiments, not for code performance. The experimental nature of data science work results in code which is iterated upon many times to get the right outcome, not necessarily to be the most robust or performant on the code level. The code is quite often a script written to process some data and provide some results, but not made to run behind a customer-facing website or critical system
  • The current stack used for data science and AI development is very diverse. Also every data scientist has a preference for certain training frameworks or libraries.
  • Spread out model artifacts. A data science model usually consists of more than a few lines of Python code. There are often artifacts like files with trained models or parameter settings that are stored in different locations than the code itself.

These problems seem manageable when you only have a single model running and real-time performance is not the first priority. Many proofs-of-concept start with getting value out of historical data, but need to make a transition to low latency inference on newly arriving streaming data. You can Dockerize your model and dependencies, run it on a cloud VM and keep an eye on the quality of the results. With a single application, you might get away with this, but when you scale up or when your data science model drives more business-critical processes and decisions, you will sleep a lot better when the infrastructure for serving and managing these models is automated and scalable.

Have no fear, MLOps is here!

So, what can we learn from DevOps best practices to make all the mentioned challenges disappear?

 

  • Automation, first of all. The deployment and management of developing, deploying, and monitoring data science models need to be a reproducible process, as it is for traditional software development. This is what DevOps is all about.
  • Version management for both models and data. As mentioned, models are trained, retrained, and optimized frequently. Often this happens on different big data sets, as these evolve as well. Therefore it is necessary to keep track of the different versions of models and which data set they originate from. Also in a production environment where rollbacks might be necessary to an earlier version as the performance is not as expected.
  • Continuous testing & monitoring. Because the performance of models depends on the quality and characteristics of (live) data, real-time monitoring is necessary to detect when the quality of the predictions starts to deviate, also called ‘drift’. Detecting this early can prevent your system from losing its touch with reality let’s say. The tight interaction between model and data also requires traceability from code to output, so it can be easily discovered which data input was converted by which model version into the output in front of you.
  • Because of the large and diversified set of tools, frameworks, and libraries available to develop data science models, consistent packaging and dependency management is an issue in production environments. Docker is always a good starting point, but also more and more protocols and conventions are arising to make portability of data science between operational environments possible.
  • Model discovery needs to be part of the solution as a central location to keep track of all available models and their artifacts. This usually goes beyond a git repository as not only code, but also parameter files and files with trained models need to be combined. It also provides a good starting point as a system for the exchange of models or base models between teams or between third parties.

The ability for managing these points also depends on the type of solution, your team, and how quickly you want to move your solution to the market, or how fast you need it to be in production.

The landscape of tools and services available to cover the mentioned points for migrating data science and AI to production is evolving very rapidly these days. New standards in this space will emerge to make the transition from the experimental data science phase to operations a lot smoother. More on this later…

Do you want to stay up to date with similar topics?  Follow our news page!