Operational Machine Learning on Google Cloud

Introduction

In modern day business, nearly every company is either already leveraging Machine Learning (ML) or has the terms ML or AI on their roadmap. With cloud computing available to minimize the time, money and resources needed to get started, we are seeing an increased adoption and democratization of ML. The required skills to develop ML algorithms are flooding the workforce and corporations are feeling internal pressure to leverage these “life altering” technological advancements.

As pressure increases, corporations around the world are investing heavily in advanced analytic solutions to either solve their most pressing business problems, personalize customer experiences, or augment intelligence into their product. Yet with all this hype, Venture Beat reports that “More than 87% of data science projects never make it into production.” That is a startling number which goes to show how many corporations are going down the path of ML but do not have the wherewithal to fully execute.

Challenges

We have identified the following challenges that are impeding a company’s ability to operationalize Machine Learning:

  • Deployment Challenges: In traditional IT, the behavior of a system is defined by code. With ML, the behavior is defined by data. This adds extra risks that must be mitigated.
  • Poor Data Management: As data profiles evolve over time, ML models are susceptible to degradation.
  • Lack of Automation: Without implementing automated monitoring and retraining, a model can be deployed once then never updated. Eventually suffering from model degradation.
  • Trusting ML Results: Black box results introduce barriers for executives and decision makers to trust the outputs of the ML model.
  • Broken Education System: Undergrad, graduate, and bootcamp programs do not teach students how to deploy models into production. Enterprises typically assign a team of data scientists or ML researchers to develop their ML solutions although due to what is taught in higher education, a majority of ML practitioners do not hone the software engineering skills required to build production-class services.

As a result of the above challenges, most organizations live on the left hand side of the ML Adoption curve. Where companies are either using machine learning for ad-hoc analysis and delivering results via manual reports or a team of data scientists are dedicated to manually triggering their ML pipelines to return predictions.

ML Adoption Curve

Solution: MLOps

MLOps is the intersection between machine learning (data processing, model deployment, metric evaluation), development (CI/CD, model integration, testing), and operations (retraining pipelines, continuous monitoring, model delivery). MLOps is more than just creating fancy algorithms like the majority of ML practitioners are accustomed to. It’s about implementing a supporting cast of technologies to control, manage, and monitor your Machine Learning workloads. In addition to the typical CI/CD process that you find in enterprise DevOps practices, Machine Learning Operations adds CT and CM to the equation:

  • Continuous Training (CT): Deploying packages to automatically retrain and serve the models as production model performance degrades or new data is collected.
  • Continuous Monitoring (CM): Systems and reporting to continuously monitor the data inputs, ongoing model performance against test versions, and explainable AI to reduce biases and ensure fairness.

Fortunately, Google Cloud Platform provides a suite of services that help reduce the complexity of managing Machine Learning systems at scale. Below is a diagram showing which tools can be used across all facets of a production grade ML system.

ML System

How to Implement

For companies that want to progress from manual model serving and infrequent model updates, ML Pipeline Automation is a natural way to elevate. This requires a separation of training and serving, think development and production environments. Living in the training environment (above the red line) is all the model development packages workflows and training pipeline packages. In this environment, a team of data scientists will collaborate to develop ML models and validate the model performance before training at scale and releasing it to production. All their code can be shared and reused in a source code repository, such as Gitlab, and packaged for retraining as new data is ingested.

In the serving environment, application data is ingested and stored in our feature store, then run through the model serving pipeline for real time inference. The model predictions are then delivered back to the business application and also into a datastore (BigQuery). Connecting a reporting tool to our datastores allows us to surface business facing reporting and also ML monitoring dashboards. Leveraging an alerting protocol, we can notify the data science team any time the underlying data fundamentally changes or if we are experiencing model drift.

More advanced companies or companies with a product that must constantly update may want to undertake a fully automated pipeline with CI/CD integrations. This implementation, as seen below, is similar to ML Pipeline Automation but rather installs a robust automated CI/CD system. In this architecture, data scientists can rapidly explore data and experiment with new models. As they make changes to their ML pipelines, they can automatically build, test, and deploy the new pipeline components to the target environment.

The following diagram shows the implementation of the ML pipeline using CI/CD, which has the characteristics of the automated ML pipelines setup plus the automated CI/CD routines.

CI/CD Pipeline Automation on GCP

MLOps Accelerators

The right Operational ML architecture as part of an App Ecosystem is as important as the kernel is to an operating system. This enables limitless architectureperpetual innovation, and adaptable components that can be added to or released as business changes. Pandera offers a catalog of templated reporting tools to supplement the ML Operations on Google Cloud Platform. The dashboards are built using Looker and cover the following topics:

  • Data Monitoring: By monitoring the data, organizations can more accurately assess their models and know when to retrain them. Using Looker’s built in alerting capabilities, data science teams can automatically notify stakeholders when significant deviations from the norm are detected. Monitoring data also allows teams to do retrospective analysis on outliers and potentially identify bugs in upstream processes.
  • Model Monitoring: Storing all model prediction results in BigQuery and implementing a human-in-the-loop feedback system allows data science teams to be able to report on continuous model performance. Continuous evaluations serve to highlight the model’s health over time as it may drift based on changing data and conditions. When performance degrades past a designated threshold, alerts are sent out to bring attention to the issue. A data scientist can then investigate to discover the cause and remediate the problem. Mature Data Science teams can install a trigger for automated model retrain rather than alerting.
  • AI Explanation: GCP provides the services and framework to extract the explanations from your Machine Learning models. We provide you with a clean interface to view those explanations. The Explainable AI dashboards provide Data Scientists a means to interpret predictions made by their machine learning models for debugging, model enhancements, and to help business stakeholders understand their models’ behavior. Within this dashboard you can look at feature attributions against aggregated predictions, individual predictions, or use a what-if simulation to investigate model behavior based on variable inputs.
  • Smart Analytics: We have created a series of Smart Apps (dashboards) targeting specific horizontal and vertical industry intersections. The goal of the Smart Apps is to rapidly turn your data into Smart Metrics to activate Smart Signals that lead to Smart Actions. The Smart Analytics Playbook enables organizations to achieve faster time to insights that deliver ML-driven prescriptive actions that anticipate what you need to know next and provide you with the most relevant answers for your business. Leveraging enterprise data and ML outputs in GCP, we can augment intelligence into an easy deploy Smart Analytics stack. Eliminate the barrier to scalable and actionable intelligence, and start taking smart actions that allow you to drive business impact.

Pandera MLOps

You can view a video presentation on this topic here: https://cloudonair.withgoogle.com/events/operational-machine-learning

Pandera Systems is a highly specialized analytics and technology consulting firm with a core focus in developing data-driven solutions. As a Certified Google Partner, we bring together the most advanced software engineers, data scientists, and technologists to help transform the world’s leading brands.

Learn more about our
Data Science Services.