A guide to MLOps for data scientists: Part 4
Originally posted on Kaskada’s “Machine Learning Insights” blog here.
In parts 1, 2 and 3 of this series, we covered the ML lifecycle, discussed how to select tools to instrument the ML lifecycle and provided an example to update your processes to enable people to adopt MLOps at your company. While your tools should bring data scientists into the production loop for shipping and maintaining ML, the MLOps mindset involves bringing data scientists in early to share the responsibility with engineering and maintain accurate online features. This brings us to the last part, a culture shift that will enable the full MLOps lifecycle.
Culture, generally, is made up of more than just what you do, but also how and why you do it. It encompasses the values, beliefs, underlying assumptions, attitudes and behaviors that are shared by your company. Changing your culture is more than just writing out what you’re going to change, it’s the resulting behavior that comes from the written and unwritten changes.
An MLOps culture is all about a shared understanding between data scientists and ops/data engineers — and shared responsibility for the machine learning models they build. That means increasing transparency, communication and collaboration across development, IT/Operations, and “the business.” You’ll know you’ve succeeded when your teams have started treating ML models as products, with owners and stakeholders feeling empowered along the way, and more of your models are making it to production — faster.
Productizing ML Models
For each model you’re deploying, your entire org should treat each like a product. Decide what you are *jointly* working toward your engineering, product, and revenue counterparts. To do this, you’ll have to write out goals and requirements as well as the business or customer-facing outcomes you intend to create. Instead of generic goals, like better accuracy and precision or arbitrarily faster deployment, tie your goal to what makes your features, model and code into a product.
Building products includes things like finding product-market fit, gathering customer feedback, and measuring the intended and unintended outcomes. For a ML model you often have more than one outcome you need to measure, the first includes accuracy of your predictions towards the business outcome and the second is the disparate impact on privileged and underprivileged groups.
Set goals by answering questions specific to what you’re building for customers. Let’s say you are trying to predict credit card fraud, answers to some common questions might look like:
- What product promises have you made? Typical product promises you might make, like these from Chase, would be “We monitor your account 24/7 using sophisticated real time fraud monitoring and can text, email or call you if there’s any unusual activity on your account.” and “With our Zero Liability Protection you won’t be held responsible for unauthorized charges made with your card or account information.”
- How fast is fast for your application? Typically, fast looks like predicting fraud as a purchase is being made and denying it before the charge is is approved and contacting the customer with enough time for them to approve it before they walk away from the register
- How accurate do the models need to be? In this case a bad impact of a false positive might be that I can’t buy necessary medications, food, or gas. But a false negative means that the bank could lose money by approving a transaction that is later marked as fraud by the customer or by the bank after using a series of transactions to flag potential fraud.
- How often do the behavioral patterns you are trying to measure change?
- Credit card fraud patterns change seasonally, annually based on promotions and cost of living increases, as well as based on events. Events are much less predictable like the pandemic, predicted hurricanes, outcomes of an election.
- The evolution of technology can also change the likelihood of fraud. When the chip and pin was introduced the theory was that in most cases fraud wasn’t possible when using a chip, but this is no longer the case.
- What groups benefit from your model and what groups are disadvantaged? Depending on the choices I make in feature engineering for these predictions, different groups may be disadvantaged more than others. We would need to measure disparate impact to know for sure.
As you can tell from the example above, this will be an iterative process. For any product, what’s the impact of false positive and negative predictions on the subjects of the prediction and the business? Then decide specifically how you will measure and surface these metrics directly to all teams, where they can find them in their workflows. And enable those teams to make hypotheses of what they can do to improve the goals, and iteratively test them out.
For more reading on enabling product driven hypotheses, one of my favorite product books offers a framework for creating products and services grounded in behavioral science. It includes a section on how to do an ethical check and measure if your impact matches your intended outcome — and what to do when it doesn’t.
This subtle culture shift will allow you and your data science team to row in the same direction as the rest of your company. Allowing for you to make a business case for adding new features to your models, iteratively improve deployed models and prioritize ML pipeline debt with other teams. At the end of the day (or cycle) you’ll be able to see the impact of your work and validate that the models you’re shipping have met the business goals set in your initial hypothesis.