A guide to MLOps for data scientists: Part 2
Originally posted on Kaskada’s “Machine Learning Insights” blog here.
In part 1 of this series we talked about the continuous ML lifecycle and what it means for data scientists to adopt MLOps. You’ll be adopting new tools, enjoy increased transparency and implement new processes and potentially new team structures. It sounds like a massive undertaking, and likely someone else’s job, to look at the entire lifecycle across multiple teams and introduce new tooling to begin instrumenting the ML lifecycle. However, it turns out all it takes to start down the path towards an MLOps future is you, an individual data scientist, centering the pain of data scientists and the people that are subject to our ML algorithms.
For me, data science is about solving complex problems with machine learning algorithms that make a positive impact on the subjects of our predictions. A salient example would be when using AI on COVID-19 data in any way, it’s not enough to build a model that is supposed to minimize the cost on our healthcare system. I would also need to show how the model will impact and continue to protect vulnerable communities.
The problem is devops engineers and software engineers often are incorrectly assuming that the end goal is to “make machine learning act more like computer science” — as is noted in Google’s recent blog on tooling for MLOps. We might assume they already know the primary needs for governance, automation and monitoring of services because DevOps has been a thing for 10+ years. However, the lifecycle of a ML model is different from other services or applications your team has deployed before. Oftentimes, this means requirements are overlooked up and down stream of feature engineering and model building.You’ll need the ability to audit, monitor and maintain both the data that feeds your features and the impact your models are making on the subjects of its predictions in production.
Areas for MLOps tooling
You know the typical stat that is often quoted — 80% of a data scientist’s time is spent on highly manual tasks, with very little time left to spend on doing the data science part of the job. As your team begins to instrument the ML lifecycle, it’s an opportunity to not only automate, but also enable your ability to solve complex problems.
Let’s take a look at the many different areas that could use tooling to support the ML lifecycle end-to-end:
- Ingesting and cleaning raw data sources
- Implementing governance and auditing
- Providing an environment to develop, share and collaborate on features
- Exporting training, test and validation data sets runs
- Tracking experiments, runs, hyperparameters, features, artifacts, etc.
- Testing ML models and features for performance, accuracy, and impact
- Releasing models and feature vectors as services
- Tracking lineage, model versions and performance
- Deploying batch scoring, real time serving, containers and cloud inference services
- Updating models in production as they inevitably go stale
It’s obvious as the data scientist, who is engineering features and ML models, you’ll be responsible for articulating feature engineering and model iteration tool requirements. And it’s tempting to split up the remaining areas among various roles across your company with the goal of getting existing models to production faster. However, this approach loses an important set of data science requirements. Data scientists have unique training that allows us to articulate requirements on how to complete the feedback loop — measuring impact over time on the subjects of your models.
This may sound like a massive undertaking, instead of writing granular requirements for every area and surveying all the tools on the market — instead write down your high level needs.
How to center data science
You are a data scientist, what do you need? Frame things like a product manager would and articulate scenarios that enable selecting the right tools. Going back to my COVID-19 healthcare example, one thing I need is to enable testing and validation of the impact of my models across demographics as well as continuously audit to ensure positive impact.
This makes you a stakeholder in the integrations of the platforms outside of your day to day. You’ll need to negotiate the priority of these needs, provide feedback and evaluate tools that provide visibility throughout the pipeline. And, because traditional DevOps platforms do not provide the ability to measure more than memory usage and response times, you’ll need to work side-by-side to define what is monitored and how it’s surfaced across the company.
The landscape of tools that address each of the above areas is overwhelming, and it’s not your job to keep up with the latest data stores, distributed systems or methods of continuous deployment or integration. Instead, as a data scientist try writing down what you need to be able to take on new responsibilities and what the possible intended and unintended impacts could be on people subject to the ML models you are building.
If you are now responsible for engineering features that can seamlessly transition from your experimental environment without a code rewrite, what does that mean for the data that feeds your environment for feature engineering? Here’s one perspective from Max Boyd, a data science lead here at Kaskada, about how data treated as facts degrades model accuracy and how event streams could solve this problem.
If models are going stale, automatically retraining ML models might not be the answer for your business. Especially if the people that are impacted by the models could be subject to systematic discrimination based on protected classes. But if you are taking on responsibility for updating models in production on a specific cadence or even continuously, what information would you need monitored and what do you need alerts on to call your attention to when there are problems?
For just one data scientist to have an impact and begin adopting MLOps, the thing you can do today, begin writing about your needs and pain points. Then map those needs to every stage of the ML lifecycle and see how it can be addressed with tooling. Not everything can be solved just with tooling, however, next time in part 3 we’ll talk about processes you can propose to enable MLOps at your company.