Spark AI Summit 2020 recap for data scientists
A guide to the three major themes from this year’s Summit: MLOps, feature stores, and deep learning.
Originally posted on Kaskada’s “Machine Learning Insights” blog here
The era of the data scientist has arrived! It seems like a dream come true — for years data scientists haven’t been able to spend their time focused on what we’re trained for: feature engineering and training models.
In 2019 I remember looking at the Spark Summit sessions to make the business case for a team of data scientists to attend, but we ended up sending our data engineers instead. (If you’re interested in a 2019 recap, this one is good).
This year, more than 30% of sessions addressed data science topics relevant to our readership, compared to only 4% in 2019! In sheer volume, that’s 64 sessions compared to 7 in 2019. So, I’ve decided to write a recap to help you navigate the 32 hours of video content that you may be tempted to start binge watching.
There were 3 major themes for data scientists: applying MLOps everywhere (also referred to CI/CD, productionizing [insert model here], orchestrating pipelines, ML lifecycle acceleration), introducing feature stores, and experimenting with deep learning in production at scale.
Applying MLOps Everywhere
An impressive line-up of sessions talked about *many* different tools folks were using to accelerate various parts of their machine learning operations (MLOps). The number one set of tools mentioned, of course, was Spark (this is, after all, the Spark + AI Summit), followed by how folks were using tools like MLflow in their pipelines.
MLOps is more than just technology, however; it also addresses the processes and culture of an organization. If you’re new to the concept, check out part one of my blog series for data scientists on MLOps.
One of the most interesting talks on the culture side of MLOps was Zalando’s session, Data Mesh In Practice: How Europe’s Leading Online Platform For Fashion goes beyond the Data Lake. The session was presented by Zalando’s lead data engineer and a lead developer at ThoughtWorks, a global software consultancy. They touched on governance, data mesh tools, adoption, data formatting, cataloging, security, dependencies, resource management, ownership, and culture change. It’s a great example of how cross-functional teams take responsibility for data products.
Other sessions worth watching on MLOps:
- Productionalizing Models through CI/CD Design with MLFlow
- Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
- Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS Sagemaker for Enterprise AI Scenarios
Introducing feature stores
While MLOps promises to bring CI/CD to ML model deployment, up until recently, MLOps consisted of separating a data scientist’s workflow (using offline data and models) from production environments.
Enter the concept of feature stores and shared data pipeline serving to reach parity between training and production models. Just a year ago at Spark 2019, no one was talking about the term “feature store” explicitly, but a few large tech companies were talking about the architectures they used to address some of the challenges of productionizing ML applications. Now, four sessions were dedicated to feature stores, encompassing different technical approaches and philosophies for unifying training and production environments. Still, the role of the feature store is ambiguous, leaving companies to wonder what exactly they might gain.
For each session, presenters took a different approach to feature stores from simply creating a registry of features in production to automating feature data delivery for ML pipelines, automating ML pipeline construction, or automating ETL/feature engineering. Consensus is: feature stores address many stages (and challenges) of productionizing ML applications and if there was something out there to buy instead of build or a set of standards to follow, that would be great. Data scientists want to focus on their job of feature engineering, not on the infrastructure.
It’s worth watching the session on the Killer Feature Store: Orchestrating Spark ML pipelines and MLflow for production, slides 5–8 do a great job of highlighting the differences in approach of feature stores today. Some are looking to automate feature data delivery to ML pipelines, others are automating ETL/Feature Engineering, and some are just doing feature orchestration. In the Q&A, folks wanted to know: should they build their own feature store or buy one? It’s a great question, and the presenter answered, “build what accelerates you, buy what differentiates you”.
Other sessions worth watching on feature stores:
- Building a Feature Store around Dataframes and Apache Spark
- Building a Real-Time Feature Store at iFood
- Zipline — A Declarative Feature Engineering Framework
Deep learning at scale
There were nine sessions using deep learning across many domains: price action, clinical language understanding, wood log inventory, text extraction, and scalable data prep.
The most application-agnostic session was by Nick Pentreath, a principal engineer in IBM’s Center for Open-source Data & AI Technology (CODAIT). He spoke methodically about scaling up by scaling down — touching on batching; quantization and other methods for trading off computational cost at training vs. inference performance; and architecture optimization and graph manipulation approaches. If you’re new to deep learning challenges and best practices, start here.
Other interesting sessions worth watching on deep learning:
- How (Not) To Scale Deep Learning in 6 Easy Steps
- Productionizing Deep Reinforcement Learning with Spark and MLflow
- End-to-End Deep Learning with Horovod on Apache Spark
What was missing?
There was one topic I was surprised to see missing from the data science content lineup was accountability and ethics. One exception was Dr. Lazzeri’s excellent session on the importance of model fairness and interpretability in AI systems. She used an open sourced interpretability toolkit, InterpretML, to show how to unpack machine learning models, gain insights into how and why they produce specific results, assess your AI systems fairness and mitigate any observed fairness issues.
It’s worth mentioning that last year there was a focus on fairness, transparency, accountability and ethics in AI that was almost missing from this year’s content. Check out the recorded sessions from 2019 for more conversations in this arena.
I hope this overview helps you navigate some of the more valuable content for data scientists! There are many more sessions if you run out of things to learn, head on over to main page to browse the 11 other tracks.