Guidelines for developing high-quality, predictive ML solutions

This document collates some guidelines to help you assess, ensure, and controlquality in building predictive machine learning (ML) solutions. It providessuggestions for every step of the process, from developing your ML models todeploying your training systems and serving systems to production. The documentextends the information that's discussed inPractitioners Guide to MLOps by highlighting and distilling the quality aspects in each process of theMLOps lifecycle.

This document is intended for anyone who is involved in building, deploying,and operating ML solutions. The document assumes that you're familiar withMLOps in general. It does not assume that you have knowledge of any specific MLplatform.

Overview of machine learning solution quality

In software engineering, many standards, processes, tools, and practices havebeen developed to ensuresoftware quality.The goal is to make sure that the software works as intended in production, andthat it meets both functional and non-functional requirements. These practicescover topics like softwaretesting,softwareverification and validation,and softwareloggingandmonitoring.InDevOps,these practices are typically integrated and automated inCI/CD processes.

MLOps is a set of standardized processes and capabilities for building, deploying, andoperating ML systems rapidly and reliably. As with other software solutions, MLsoftware solutions require you to integrate these software qualitypractices and apply them throughout the MLOps lifecycle. By applying thesepractices, you help make sure the trustworthiness and predictability of your models,and that the models conform to your requirements.

However, the tasks of building, deploying, and operating ML systems presentadditional challenges that require certain quality practices that might not berelevant to other software systems. In addition to the characteristics of mostof the other software systems, ML systems have the following characteristics:

  • Data-dependent systems. The quality of the trained models and of theirpredictions depends on the validity of the data that's used for training andthat's submitted for prediction requests. Any software system depends onvalid data, but ML systems deduce the logic for decision-making from thedata automatically, so they are particularly dependent on the quality ofthe data.

  • Dual training-serving systems. ML workloads typically consist of twodistinct but related production systems: the training system and theserving system. A continuous training pipeline produces newly trainedmodels that are then deployed for prediction serving. Each system requiresa different set of quality practices that balance effectiveness andefficiency in order to produce and maintain a performant model inproduction. In addition, inconsistencies between these two systems resultin errors and poor predictive performance.

  • Prone to staleness. Models often degrade after they're deployed inproduction because the models fail to adapt to changes in the environmentthat they represent, such as seasonal changes in purchase behavior. Themodels can also fail to adapt to changes in data, such as new productsand locations. Thus, keeping track of the effectiveness of the model inproduction is an additional challenge for ML systems.

  • Automated decision-making systems. Unlike other software systems, whereactions are carefully hand-coded for a set of requirements and business rules,ML models learn rules from data to make a decision. Implicit bias in the datacan lead models to produce unfair outcomes.

When a deployed ML model produces bad predictions, the poor ML quality can bethe result of a wide range of problems. Some of these problems can arise fromthe typical bugs that are in any program. But ML-specific problems can alsoinclude data skews and anomalies, along with the absence of proper modelevaluation and validation procedures as a part of the training process. Anotherpotential issue is inconsistent data format between the model's built-in interfaceand the serving API. In addition, model performance degrades over time evenwithout these problems, and it can fail silently if it's not properlymonitored. Therefore, you should include different kinds of testing andmonitoring for ML models and systems during development, during deployment, andin production.

Quality guidelines for model development

When you develop an ML model during the experimentation phase, you have thefollowing two sets of target metrics that you can use to assess the model'sperformance:

  • The model's optimizing metrics. This metric reflects the model'spredictive effectiveness. The metric includesaccuracy andf-measure in classification tasks,mean absolute percentage error in regression and forecasting tasks,discounted cumulative gain in ranking tasks, andperplexity andBLEU scores in language models. The better the value of this metric, the betterthe model is for a given task. In some use cases, to ensure fairness, it'simportant to achieve similar predictive effectiveness on different slicesof the data—for example, on different customer demographics.
  • The model's satisficing metrics. This metric reflects an operationalconstraint that the model needs to satisfy, such as prediction latency. Youset a latency threshold to a particular value, such as 200 milliseconds.Any model that doesn't meet the threshold is not accepted. Another exampleof a satisficing metric is the size of the model, which is important whenyou want to deploy your model to low-powered hardware like mobile andembedded devices.

During experimentation, you develop, train, evaluate, and debug your model toimprove its effectiveness with respect to the optimizing metrics, withoutviolating the satisficing metric thresholds.

Guidelines for experimentation

  • Have predefined and fixed thresholds for optimizing metrics and forsatisficing metrics.
  • Implement a streamlined evaluation routine that takes a model and dataand produces a set of evaluation metrics. Implement the routine so it worksregardless of the type of the model (for example, decision trees orneural networks) or the model's framework (for example,TensorFlow or Scikit-learn).
  • Make sure that you have a baseline model to compare with. This baselinecan consist of hardcoded heuristics or it can be a simple model thatpredicts the mean or the mode target value. Use the baseline modelto check the performance of the ML model. If the ML model isn't betterthan the baseline model, there is a fundamental problem in the ML model.
  • Track every experiment that has been done to help you withreproducibility and incremental improvement. For each experiment, storehyperparameter values, feature selection, and random seeds.

Guidelines for data quality

  • Address any imbalanced classes early in your experiments by choosingthe right evaluation metric. In addition, apply techniques like upweightingminority class instances or downsampling majority class instances.
  • Make sure that you understand the data source at hand, and perform therelevantdata preprocessing and feature engineering to prepare the training dataset. This type of process needs to berepeatable and automatable.
  • Make sure that you have a separatetesting data split (holdout) for the final evaluation of the model. The test split should notbe seen during training, and don't use it for hyperparameter tuning.
  • Make sure that training, validation, and test splits are equallyrepresentative of your input data. Sampling such a test split depends onthe nature of the data and of the ML task at hand. For example, stratifiedsplitting is relevant to classification tasks, while chronologicalsplitting is relevant to time-series tasks.
  • Make sure that the validation and test splits are preprocessedseparately from the training data split. If the splits are preprocessed in amixture, it leads todata leakage.For example, when you use statistics to transform data for normalization orfor bucketizing numerical features, compute the statistics from the trainingdata and apply them to normalize the validation and test splits.
  • Generate a dataset schema that includes the data types and somestatistical properties of the features. You can use this schema to findanomalous or invalid data during experimentation and training.
  • Make sure that your training data is properly shuffled in batches but thatit also still meets the model training requirements. For example, this taskcan apply to positive and negative instance distributions.
  • Have a separate validation dataset for hyperparameter tuning and modelselection. You can also use the validation dataset to perform earlystopping. Otherwise, you can let the model train for the entirety of thegiven set of maximum iterations. However, only save a new snapshot of themodel if its performance on the validation dataset improves relative to theprevious snapshot.

Guidelines for model quality

  • Make sure that your models don't have any fundamental problems thatprevent them from learning any relationship between the inputs and theoutputs. You can achieve this goal by training the model with very fewexamples. If the model doesn't achieve high accuracy for these examples,there might be a bug in your model implementation or training routine.
  • When you're training neural networks, monitor forNaN values in yourloss and for the percentage of weights that have zero values throughout yourmodel training. TheseNaN or zero values can be indications oferroneous arithmetic calculations, or of vanishing or exploding gradients.Visualizing changes in weight-values distribution over time can help youdetect theinternal covariate shifts that slow down the training. You can apply batch normalization toalleviate this reduction in speed.
  • Compare your model performance on the training data and on the test datato understand if your model is overfitting or underfitting. If you seeeither of these issues, perform the relevant improvements. For example, ifthere is underfitting, you might increase the model's learning capacity. Ifthere was overfitting, you might apply regularization.
  • Analyze misclassified instances, especially the instances that have highprediction confidence and the most-confused classes in the multi-classconfusion matrix.These errors can be an indication of mislabeled training examples. Theerrors can also identify an opportunity for data preprocessing, such asremoving outliers, or for creating new features to help discriminatebetween such classes.
  • Analyze feature importance scores and clean up features that don't addenough improvement to the model's quality.Parsimonious models are preferred over complex ones.

Quality guidelines for training pipeline deployment

As you implement your model and model training pipeline, you need to create aset of tests in a CI/CD routine. These tests run automatically as youpush new code changes, or they run before you deploy your training pipeline tothe target environment.

Guidelines

  • Unit-test the feature engineering functionality.
  • Unit-test the encoding of the inputs to the model.
  • Unit-test user-implemented (custom) modules of the modelsindependently—for example, unit-test custom graph convolution and poolinglayers, or custom attention layers.
  • Unit-test any custom loss or evaluation functions.
  • Unit-test the output types and shapes of your model against expected inputs.
  • Unit-test that thefit function of the model works without any errorson a couple of small batches of data. The tests should make sure that theloss decreases and that the execution time of the training step is asexpected. You make these checks because changes in model code can introducebugs that slow down the training process.
  • Unit-test the model's save and load functionality.
  • Unit-test the exported model-serving interfaces against raw inputs andagainst expected outputs.
  • Test the components of the pipeline steps with mock inputs and withoutput artifacts.
  • Deploy the pipeline to a test environment and perform integrationtesting of the end-to-end pipeline. For this process, use some testing datato make sure that the workflow executes properly throughout and that itproduces the expected artifacts.
  • Useshadow deployment when you deploy a new version of the training pipeline to the productionenvironment. A shadow deployment helps you make sure that the newly deployedpipeline version is executed on live data in parallel to the previouspipeline version.

Quality guidelines for continuous training

The continuous training process is about orchestrating and automating theexecution of training pipelines. Typical training workflows include steps likedata ingestion and splitting, data transformation, model training, modelevaluation, and model registration. Some training pipelines consist of morecomplex workflows. Additional tasks can include performing self-supervised modeltraining that uses unlabeled data, or building an approximate nearest neighborindex for embeddings. The main input of any training pipeline is new trainingdata, and the main output is a new candidate model to deploy inproduction.

The training pipeline runs in production automatically, based on aschedule (for example, daily or weekly) or based on a trigger (for example, whennew labeled data is available). Therefore, you need to add quality-control stepsto the training workflow, specifically data-validation steps andmodel-validation steps. These steps validate the inputs and the outputs of thepipelines.

You add the data-validation step after the data-ingestion step in the trainingworkflow. The data-validation step profiles the new input training data that'singested into the pipeline. During profiling, the pipeline uses a predefineddata schema, which was created during the ML development process, to detectanomalies. Depending on the use case, you can ignore or just remove some invalidrecords from the dataset. However, other issues in the newly ingested data mighthalt the execution of the training pipeline, so you must identify and addressthose issues.

Guidelines for data validation

  • Verify that the features of the extracted training data are completeand that they match the expected schema—that is, there are no missingfeatures and no added ones. Also verify that features match the projectedvolumes.
  • Validate the data types and the shapes of the features in the datasetthat are ingested into the training pipeline.
  • Verify that the formats of particular features (for example, dates, times,URLs, postcodes, and IP addresses) match the expected regular expressions.Also verify that features fall within valid ranges.
  • Validate the maximum fraction of the missing values for each feature. Alarge fraction of missing values in a particular feature can affect themodel training. Missing values usually indicate an unreliable feature source.
  • Validate the domains of the input features. For example, check if thereare changes in a vocabulary of categorical features or changes in the rangeof numerical features, and adjust data preprocessing accordingly. Asanother example, ranges for numerical features might change if an update inthe upstream system that populates the features uses different units ofmeasure. For example, the upstream system might change currencyfrom dollars to yen, or it might change distances from kilometers to meters.
  • Verify that the distributions of each feature match your expectations.For example, you might test that the most common value of a feature forpayment type iscash and that this payment type accounts for 50% of allvalues. However, this test can fail if there's a change in the most commonpayment type tocredit_card. An external change like this might requirechanges in your model.

You add a model validation step before the model registration step to make surethat only models that pass the validation criteria are registered for productiondeployment.

Guidelines for model validation

  • For the final model evaluation, use a separate test split that hasn'tbeen used for model training or for hyperparameter tuning.
  • Score the candidate model against the test data split, compute therelevant evaluation metrics, and verify that the candidate model surpassespredefined quality thresholds.
  • Make sure that the test data split is representative of the data as awhole to account for varying data patterns. For time-series data, make surethat the test split contains more recent data than the training split.
  • Test model quality on important data slices like users by country ormovies by genre. By testing on sliced data, you avoid a problem wherefine-grained performance issues are masked by a global summary metric.
  • Evaluate the current (champion) model against the test data split, andcompare it to the candidate (challenger) model that the training pipelineproduces.
  • Validate the model against fairness indicators to detect implicitbias—for example, implicit bias might be induced by insufficient diversityin the training data. Fairness indicators can reveal root-cause issues thatyou must address before you deploy the model to production.

During continuous training, you can validate the model against both optimizingmetrics and satisficing metrics. Alternatively, you might validate the modelonly against the optimizing metrics and defer validating against the satisficingmetric until the model deployment phase. If you plan to deploy variations of thesame model to different serving environments or workloads, it can be moresuitable to defer validation against the satisficing metric. Different servingenvironments or workloads (such as cloud environments versus on-deviceenvironments, or real-time environments versus batch serving environments)might require different satisficing metric thresholds. If you'redeploying to multiple environments, your continuous training pipeline mighttrain two or more models, where each model is optimized for its targetdeployment environment. For more information and an example, seeDual deployments on Vertex AI.

As you put more continuous-training pipelines with complex workflows intoproduction, you must track the metadata and the artifacts that the pipeline runsproduce. Tracking this information helps you trace and debug any issue thatmight arise in production. Tracking the information also helps you reproduce theoutputs of the pipelines so that you can improve their implementation insubsequent ML development iterations.

Guidelines for tracking ML metadata and artifacts

  • Track lineage of the source code, deployed pipelines, components of thepipelines, pipeline runs, the dataset in use, and the produced artifacts.
  • Track the hyperparameters and the configurations of the pipeline runs.
  • Track key inputs and output artifacts of the pipeline steps, likedataset statistics, dataset anomalies (if any), transformed data andschemas, model checkpoints, and model evaluation results.
  • Track that conditional pipeline steps run in response to the conditions,and ensure observability by adding altering mechanisms in case key stepsdon't run or if they fail.

Quality guidelines for model deployment

Assume that you have a trained model that's been validated from an optimizingmetrics perspective, and that the model is approved from a model governanceperspective (as described later in themodel governance section). The model is stored in the model registry and is ready to be deployedto production. At this point, you need to implement a set of tests to verifythat the model is fit to serve in its target environment. You also need toautomate these tests in a model CI/CD routine.

Guidelines

  • Verify that the model artifact can be loaded and invoked successfullywith its runtime dependencies. You can perform this verification by stagingthe model in a sandboxed version of the serving environment. Thisverification helps you make sure that the operations and binaries that areused by the model are present in the environment.
  • Validate satisficing metrics of the model (if any) in a staging environment,like model size and latency.
  • Unit-test the model-artifact-serving interfaces in a staging environmentagainst raw inputs and against expected outputs.
  • Unit-test the model artifact in a staging environment for a set of typicaland edge cases of prediction requests. For example, unit-test for arequest instance where all features are set toNone.
  • Smoke-test the model service API after it's been deployed to itstarget environment. To perform this test, send a single instance or abatch of instances to the model service and validate the service response.
  • Canary-test the newly deployed model version on a small stream of liveserving data. This test makes sure that the new model service doesn'tproduce errors before the model is exposed to a large number of users.
  • Test in a staging environment that you can roll back to a previousserving model version quickly and safely.
  • Perform online experimentation to test the newly trained model using asmall subset of the serving population. This test measures the performance ofthe new model compared to the current one. After you compare the newmodel's performance to the performance of the current model, you mightdecide to fully release the new model to serve all of your live predictionrequests. Online experimentation techniques includeA/B testing andMulti-Armed Bandit (MAB).

Quality guidelines for model serving

The predictive performance of the ML models that are deployed and are servingin production usually degrades over time. This degradation can be due toinconsistencies that have been introduced between the serving features and thefeatures that are expected by the model. These inconsistencies are calledtraining-serving skew. For example, a recommendation model might be expectingan alphanumeric input value for a feature like a most-recently-viewed productcode. But instead, the product name rather than the product code is passedduring serving, due to an update to the application that's consuming the modelservice.

In addition, the model can go stale as the statistical properties ofthe serving data drift over time, and the patterns that were learned by thecurrent deployed model are no longer accurate. In both cases, the model can nolonger provide accurate predictions.

To avoid this degradation of the model's predictive performance, you mustperform continuous monitoring of the model's effectiveness. Monitoring lets youregularly and proactively verify that the model's performance doesn't degrade.

Guidelines

  • Log a sample of the serving request-response payloads in a data storefor regular analysis. The request is the input instance, and the responseis the prediction that's produced by the model for that data instance.
  • Implement an automated process that profiles the stored request-responsedata by computing descriptive statistics. Compute and store these servingstatistics at regular intervals.
  • Identify training-serving skew that's caused by data shift and drift bycomparing the serving data statistics to the baseline statistics of thetraining data. In addition, analyze how the serving data statistics changeover time.
  • Identifyconcept drift by analyzing how feature attributions for the predictions change over time.
  • Identify serving data instances that are considered outliers withrespect to the training data. To find these outliers, use novelty detectiontechniques and track how the percentage of outliers in the serving datachanges over time.
  • Set alerts for when the model reaches skew-score thresholds on the keypredictive features in your dataset.
  • If labels are available (that is, ground truth), join the true labelswith the predicted labels of the serving instances to perform continuousevaluation. This approach is similar to the evaluation system that youimplement as A/B testing during online experimentation. Continuousevaluation can identify not only the predictive power of your model inproduction, but also identify which type of request it performs well withand performs poorly with.
  • Set objectives for system metrics that are important to you, and measurethe performance of the models according to those objectives.
  • Monitor service efficiency to make sure that your model can serve inproduction at scale. This monitoring also helps you predict and managecapacity planning, and it helps you estimate the cost of your servinginfrastructure. Monitor efficiency metrics, including CPU utilization,GPU utilization, memory utilization, service latency, throughputs,and error rate.

Model governance

Model governance is a core function in companies that provides guidelines andprocesses to help employees implement the company's AI principles. Theseprinciples can include avoiding models that create or enforce bias, and beingable to justify AI-made decisions. The model governance function makes sure thatthere is a human in the loop. Having human review is particularly important forsensitive and high-impact workloads (often user-facing ones). Workloads likethis can include scoring credit risk, ranking job candidates, approvinginsurance policies, and propagating information on social media.

Guidelines

  • Have aresponsibility assignment matrix for each model by task. The matrix should consider cross-functional teams(lines of business, data engineering, data science, ML engineering, riskand compliance, and so on) along the entire organization hierarchy.
  • Maintain model documentation and reporting in the model registry that'slinked to a model's version—for example, by usingmodel cards.Such metadata includes information about the data that was used to trainthe model, about model performance, and about any known limitations.
  • Implement a review process for the model before you approve it fordeployment in production. In this type of process, you keep versions of themodel's checklist, supplementary documentation, and any additionalinformation that stakeholders might request.
  • Evaluate the model on benchmark datasets (also known asgoldendatasets), which cover both standard cases and edge cases. In addition,validate the model against fairness indicators to help detect implicit bias.
  • Explain to the model's users the model predictive behavior as a wholeand on specific sample input instances. Providing this information helpsyou understand important features and possible undesirable behavior of themodel.
  • Analyze the model's predictive behavior usingwhat-if analysis tools to understand the importance of different data features. This analysis canalso help you visualize model behavior across multiple models and subsetsof input data.
  • Test the model againstadversarial attacks to help make sure that the model is robust against exploitation in production.
  • Track alerts on the predictive performance of models that are in production,on dataset shifts, and on drift. Configure the alerts to notify modelstakeholders.
  • Manage online experimentation, rollout, and rollback of the models.

What's next

Contributors

Author:Mike Styer | Generative AI Solution Architect

Other contributor:Amanda Brinhosa | Customer Engineer

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-07-08 UTC.