Deploy and operate generative AI applications

Generative AI has introduced a new way to build and operate AI applications thatis different from predictive AI. To build a generative AI application, you mustchoose from a diverse range of architectures and sizes, curate data, engineeroptimal prompts, tune models for specific tasks, and ground model outputs inreal-world data.

This document describes how you can adapt DevOps and MLOps processes to develop,deploy, and operate generative AI applications on existing foundation models.For information on deploying predictive AI, seeMLOps: Continuous delivery andautomation pipelines in machinelearning.

What are DevOps and MLOps?

DevOps is a software engineering methodology that connects development andoperations. DevOps promotes collaboration, automation, and continuousimprovement to streamline the software development lifecycle, using practicessuch as continuous integration and continuous delivery (CI/CD).

MLOps builds on DevOps principles to address the challenges of building andoperating machine learning (ML) systems. Machine learning systems typically usepredictive AI to identify patterns and make predictions. The MLOps workflowincludes the following:

Data validation
Model training
Model evaluation and iteration
Model deployment and serving
Model monitoring

What are foundation models?

Foundation models are the core component in a generative AI application. Thesemodels are large programs that use datasets to learn and make decisions withouthuman intervention. Foundation models are trained on many types of data,including text, images, audio, and video. Foundation models include largelanguage models (LLMs) such as Llama 3.1 and multimodal models such asGemini.

Unlike predictive AI models, which are trained for specific tasks on focuseddatasets, foundation models are trained on massive and diverse datasets. Thistraining lets you use foundation models to develop applications for manydifferent use cases. Foundation models haveemergent properties(PDF), which let them provide responsesto specific inputs without explicit training. Because of these emergentproperties, foundation models are challenging to create and operate and requireyou to adapt your DevOps and MLOps processes.

Developing a foundation model requires significant data resources, specializedhardware, significant investment, and specialized expertise. Therefore, manybusinesses prefer to use existing foundation models to simplify the developmentand deployment of their generative AI applications.

Lifecycle of a generative AI application

The lifecycle for a generative AI application includes the following phases:

Discovery: Developers and AI engineers identify which foundation model ismost suitable for their use case. They consider each model's strengths,weaknesses, and costs to make an informed decision.
Development and experimentation: Developers use prompt engineering tocreate and refine input prompts to get the required output. When available,few-shot learning,parameter-efficientfine-tuning(PEFT), and model chaining help guide model behavior.Model chaining refers toorchestrating calls to multiple models in a specific sequence to create aworkflow.
Deployment: Developers must manage many artifacts in the deploymentprocess, including prompt templates, chain definitions, embedded models,retrieval data stores, and fine-tuned model adapters. These artifacts havetheir own governance requirements and require careful management throughoutdevelopment and deployment. Generative AI application deployment also mustaccount for the technical capabilities of the target infrastructure, ensuringthat application hardware requirements are met.
Continuous monitoring in production: Administrators improve applicationperformance and maintain safety standards through responsible AI techniques,such as ensuring fairness, transparency, and accountability in the model'soutputs.
Continuous improvement: Developers constantly adjust foundation modelsthrough prompting techniques, swapping the models out for newer versions, oreven combining multiple models for enhanced performance, cost efficiency, orreduced latency. Conventional continuous training still holds relevance forscenarios when recurrent fine-tuning or incorporating human feedback loops areneeded.

Data engineering practices have a critical role across all development stages.To create reliable outputs, you must have factual grounding (which ensures thatthe model's outputs are based on accurate and up-to-date information) and recentdata from internal and enterprise systems. Tuning data helps adapt models tospecific tasks and styles, and rectifies persistent errors.

Find the foundation model for your use case

Because building foundation models is resource-intensive, most businesses preferto use an existing foundation model that is optimal for their use case. Findingthe right foundation model is difficult because there are many foundationmodels. Each model has different architectures, sizes, training datasets, andlicenses. In addition, each use case presents unique requirements, demandingthat you analyze available models across multiple dimensions.

Consider the following factors when you assess models:

Quality: Run test prompts to gauge output quality.
Latency and throughput: Determine the correct latency and throughput thatyour use case requires, as these factors directly impact user experience. Forexample, a chatbot requires lower latency than batch-processed summarizationtasks.
Development and maintenance time: Consider the time investment for initialdevelopment and ongoing maintenance. Managed models often require less effortthan openly available models that you deploy yourself.
Usage cost: Consider the infrastructure and consumption costs that areassociated with the model.
Compliance: Assess the model's ability to adhere to relevant regulationsand licensing terms.

Develop and experiment

When building generative AI applications, development and experimentation areiterative and orchestrated. Each experimental iteration involves refining data,adapting the foundation model, and evaluating results. Evaluation providesfeedback that guides subsequent iterations in a continuous feedback loop. Ifperformance doesn't match expectations, you can gather more data, augment thedata, or further curate the data. In addition, you might need to optimizeprompts, apply fine-turning techniques, or change to another foundation model.This iterative refinement cycle, driven by evaluation insights, is just asimportant for optimizing generative AI applications as it is for machinelearning and predictive AI.

The foundation model paradigm

Foundation models differ from predictive models because they are multi-purposemodels. Instead of being trained for a single purpose on data specific to thattask, foundation models are trained on broad datasets, which lets you apply afoundation model to many different use cases.

Foundation models are also highly sensitive to changes in their input. Theoutput of the model and the task that it performs are determined by the input tothe model. A foundation model can translate text, generate videos, or classifydata simply by changing the input. Even insignificant changes to the input canaffect the model's ability to correctly perform that task.

These properties of foundation models require different development andoperational practices. Although models in the predictive AI context areself-sufficient and task-specific, foundation models are multi-purpose and needan additional element beyond the user input. Generative AI models require aprompt, and more specifically, aprompt template. A prompt template is a setof instructions and examples along with placeholders to accommodate user input.The application can combine the prompt template and the dynamic data (such asthe user input) to create a complete prompt, which is the text that is passed asinput to the foundation model.

The prompted model component

The presence of the prompt is a distinguishing feature of generative AIapplications. The model and the prompt aren't sufficient for the generation ofcontent; generative AI needs both. The combination of the model and the promptis known as theprompted model component. The prompted model component is thesmallest independent component that is sufficient to create a generative AIapplication. The prompt doesn't need to be complicated. For example, it can be asimple instruction, such as "translate the following sentence from English toFrench", followed by the sentence to be translated. However, without thatpreliminary instruction, a foundation model won't perform the requiredtranslation task. So a prompt, even just a basic instruction, is necessary alongwith the input to get the foundation model to do the task required by theapplication.

The prompted model component creates an important distinction for MLOpspractices when developing generative AI applications. In the development of agenerative AI application, experimentation and iteration must be done in thecontext of a prompted model component. The generative AI experimentation cycletypically begins with testing variations of the prompt — changing the wording ofthe instructions, providing additional context, or including relevant examples —and evaluating the impact of those changes. This practice is commonly referredto asprompt engineering.

Prompt engineering involves the following iterative steps:

Prompting: Craft and refine prompts to elicit desired behaviors from afoundation model for a specific use case.
Evaluation: Assess the model's outputs, ideally programmatically, to gaugeits understanding and success in fulfilling the prompt's instructions.

To track evaluation results, you can optionally register the results of anexperiment. Because the prompt itself is a core element of the promptengineering process, it becomes the most important artifact within the artifactsthat are part of the experiment.

However, to experiment with a generative AI application, you must identify theartifact types. In predictive AI, data, pipelines, and code are different. Butwith the prompt paradigm in generative AI, prompts can include context,instructions, examples, guardrails, and actual internal or external data pulledfrom somewhere else.

To determine the artifact type, you must recognize that a prompt has differentcomponents and requires different management strategies. Consider the following:

Prompt as data: Some parts of the prompt act just like data. Elements likefew-shot examples, knowledge bases, and user queries are essentially datapoints. These components require data-centric MLOps practices such as datavalidation, drift detection, and lifecycle management.
Prompt as code: Other components such as context, prompt templates, andguardrails are similar to code. These components define the structure andrules of the prompt itself and require more code-centric practices such asapproval processes, code versioning, and testing.

As a result, when you apply MLOps practices to generative AI, you must haveprocesses that give developers an easy way to store, retrieve, track, and modifyprompts. These processes allow for fast iteration and principledexperimentation. Often one version of a prompt can work well with a specificversion of the model and not as well with a different version. When you trackthe results of an experiment, you must record the prompt, the components'versions, the model version, metrics, and output data.

Model chaining and augmentation

Generative AI models, particularly large language models (LLMs), face inherentchallenges in maintaining recency and avoiding hallucinations. Encoding newinformation into LLMs requires expensive and data-intensive pre-training beforethey can be deployed. Depending on the use case, using only one prompted modelto perform a particular generation might not be sufficient. To solve this issue,you can connect several prompted models together, along with calls to externalAPIs and logic expressed as code. A sequence of prompted model componentsconnected together in this way is commonly known as achain.

The following diagram shows the components of a chain and the relativedevelopment process.

Model chains in the development process.

Mitigation for recency and hallucination

Two common chain-based patterns that can mitigate recency and hallucinations areretrieval-augmented generation (RAG)(PDF) and agents.

RAG augments pre-trained models with knowledge retrieved from databases, whichbypasses the need for pre-training. RAG enables grounding and reduceshallucinations by incorporating up-to-date factual information directly intothe generation process.
Agents, popularized by theReAct prompting technique(PDF), use LLMs as mediators thatinteract with various tools, including RAG systems, internal or external APIs,custom extensions, or even other agents. Agents enable complex queries andreal-time actions by dynamically selecting and using relevant informationsources. The LLM, acting as an agent, interprets the user's query, decideswhich tool to use, and formulates the response based on the retrievedinformation.

You can use RAG and agents to create multi-agent systems that are connected to large information networks, enabling sophisticated query handling and real-time decision making.

The orchestration of different models, logic, and APIs is not new to generativeAI applications. For example, recommendation engines combine collaborativefiltering models, content-based models, and business rules to generatepersonalized product recommendations for users. Similarly, in fraud detection,machine learning models are integrated with rule-based systems and external datasources to identify suspicious activities.

What makes these chains of generative AI components different is that you can'tcharacterize the distribution of component inputs beforehand, which makes theindividual components much harder to evaluate and maintain in isolation.Orchestration causes a paradigm shift in how you develop AI applications forgenerative AI.

In predictive AI, you can iterate on the separate models and components inisolation and then chain them in the AI application. In generative AI, youdevelop a chain during integration, perform experimentation on the chainend-to-end, and iterate chaining strategies, prompts, foundation models, andother APIs in a coordinated manner to achieve a specific goal. You often don'tneed feature engineering, data collection, or further model training cycles;just changes to the wording of the prompt template.

The shift towards MLOps for generative AI, in contrast to MLOps for predictiveAI, results in the following differences:

Evaluation: Because of the tight coupling of chains, chains requireend-to-end evaluation, not just for each component, to gauge their overallperformance and the quality of their output. In terms of evaluation techniquesand metrics, evaluating chains is similar to evaluating prompted models.
Versioning: You must manage a chain as a complete artifact in itsentirety. You must track the chain configuration with its own revision historyfor analysis, for reproducibility, and to understand the effects of changes onoutput. Your logs must include the inputs, outputs, intermediate states of thechain, and any chain configurations that were used during each execution.
Continuous monitoring: To detect performance degradation, data drift, orunexpected behavior in the chain, you must configure proactive monitoringsystems. Continuous monitoring helps to ensure early identification ofpotential issues to maintain the quality of the generated output.
Introspection: You must inspect the internal data flows of a chain (thatis, the inputs and outputs from each component) as well as the inputs andoutputs of the entire chain. By providing visibility into the data that flowsthrough the chain and the resulting content, developers can pinpoint thesources of errors, biases, or undesirable behavior.

The following diagram shows how chains, prompted model components, and modeltuning work together in a generative AI application to reduce recency andhallucinations. Data is curated, models are tuned, and chains are added tofurther refine responses. After the results are evaluated, developers can logthe experiment and continue to iterate.

Chains, prompted model, and model tuning in generative AI applications.

Fine-tuning

When you are developing a generative AI use case that involves foundationmodels, it can be difficult, especially for complex tasks, to rely on onlyprompt engineering and chaining to solve the use case. To improve taskperformance, developers often need to fine-tune the model directly. Fine-tuninglets you actively change all the layers or a subset of layers (parameterefficient fine-tuning) of the model to optimize its ability to perform a certaintask. The most common ways of tuning a model are the following:

Supervised fine-tuning: You train the model in a supervised manner,teaching it to predict the right output sequence for a given input.
Reinforcement learning from human feedback (RLHF): You train a rewardmodel to predict what humans would prefer as a response. Then, you use thisreward model to nudge the LLM in the right direction during the tuningprocess. This process is similar to having a panel of human judges guide themodel's learning.

The following diagram shows how tuning helps refine the model during theexperimentation cycle.

Fine-turning models.

In MLOps, fine-tuning shares the following capabilities with model training:

The ability to track the artifacts that are part of the tuning job. Forexample, artifacts include the input data or the parameters being used to tunethe model.
The ability to measure the impact of the tuning. This capability lets youevaluate the tuned model for the specific tasks that it was trained on and tocompare results with previously tuned models or frozen models for the sametask.

Continuous training and tuning

In MLOps, continuous training is the practice of repeatedly retraining machinelearning models in a production environment. Continuous training helps to ensurethat the model remains up-to-date and performs well as real-world data patternschange over time. For generative AI models, continuous tuning of the models isoften more practical than a retraining process because of the high data andcomputational costs involved.

The approach to continuous tuning depends on your specific use case and goals.For relatively static tasks like text summarization, the continuous tuningrequirements might be lower. But for dynamic applications like chatbots thatneed constant human alignment, more frequent tuning using techniques like RLHFthat are based on human feedback is necessary.

To determine the right continuous tuning strategy, you must evaluate the natureof your use case and how the input data evolves over time. Cost is also a majorconsideration, as compute infrastructure greatly affects the speed and expenseof tuning. Graphics processing units (GPUs) and tensor processing units (TPUs)are hardware that is required for fine-tuning. GPUs, known for their parallelprocessing power, are highly effective in handling the computationally intensiveworkloads and are often associated with training and running complex machinelearning models. TPUs, on the other hand, are specifically designed by Googlefor accelerating machine learning tasks. TPUs excel in handling large-matrixoperations that are common in deep learning neural networks.

Data practices

Previously, ML model behavior was dictated solely by its training data. Whilethis still holds true for foundation models, the model behavior for generativeAI applications that are built on top of foundation models is determined by howyou adapt the model with different types of input data.

Foundation models are trained on data such as the following:

Pretraining datasets (for example, C4,The Pile,or proprietary data)
Instruction tuning datasets
Safety tuning datasets
Human preference data

Generative AI applications are adapted on data such as the following:

Prompts
Augmented or grounded data (for example, websites, documents, PDFs, databases,or APIs)
Task-specific data for PEFT
Task-specific evaluations
Human preference data

The main difference for data practices between predictive ML and generative AIis at the beginning of the lifecycle process. In predictive ML, you spend a lotof time on data engineering, and if you don't have the right data, you cannotbuild an application. In generative AI, you start with a foundation model, someinstructions, and maybe a few example inputs (such as in-context learning). Youcan prototype and launch an application with very little data.

The ease of prototyping, however, comes with the additional challenge ofmanaging diverse data. Predictive AI relies on well-defined datasets. Ingenerative AI, a single application can use various data types, from completelydifferent data sources, all working together.

Consider the following data types:

Conditioning prompts: Instructions given to the foundation model to guideits output and set boundaries of what it can generate.
Few-shot examples: A way to show the model what you want to achievethrough input-output pairs. These examples help the model understand thespecific tasks, and in many cases, these examples can boost performance.
Grounding or augmentation data: The data that permits the foundation modelto produce answers for a specific context and keep responses current andrelevant without retraining the entire foundation model. This data can comefrom external APIs (like Google Search) or internal APIs and data sources.
Task-specific datasets: The datasets that help fine-tune an existingfoundation model for a particular task, improving its performance in thatspecific area.
Full pre-training datasets: The massive datasets that are used toinitially train foundation models. Although application developers might nothave access to them or the tokenizers, the information encoded in the modelitself influences the application's output and performance.

This diverse range of data types adds a complexity layer in terms of dataorganization, tracking, and lifecycle management. For example, a RAG-basedapplication can rewrite user queries, dynamically gather relevant examples usinga curated set of examples, query a vector database, and combine the informationwith a prompt template. A RAG-based application requires you to manage multipledata types, including user queries, vector databases with curated few-shotexamples and company information, and prompt templates.

Each data type needs careful organization and maintenance. For example, a vectordatabase requires processing data into embeddings, optimizing chunkingstrategies, and ensuring only relevant information is available. A prompttemplate needs versioning and tracking, and user queries need rewriting. MLOpsand DevOps best practices can help with these tasks. In predictive AI, youcreate data pipelines for extraction, transformation, and loading. In generativeAI, you build pipelines to manage, evolve, adapt, and integrate different datatypes in a versionable, trackable, and reproducible way.

Fine-tuning foundation models can boost generative AI application performance,but the models need data. You can get this data by launching your applicationand gathering real-world data, generating synthetic data, or a mix of both.Using large models to generate synthetic data is becoming popular because thismethod speeds up the deployment process, but it's still important to havehumans check the results for quality assurance. The following are examples of howyou can use large models for data engineering purposes:

Synthetic data generation: This process involves creating artificial datathat closely resembles real-world data in terms of its characteristics andstatistical properties. Large and capable models often complete this task.Synthetic data serves as additional training data for generative AI, enablingit to learn patterns and relationships even when labeled real-world data isscarce.
Synthetic data correction: This technique focuses on identifying andcorrecting errors and inconsistencies within existing labeled datasets. Byusing the power of larger models, generative AI can flag potential labelingmistakes and propose corrections to improve the quality and reliability of thetraining data.
Synthetic data augmentation: This approach goes beyond generating newdata. Synthetic data augmentation involves intelligently manipulating existingdata to create diverse variations while preserving essential features andrelationships. Generative AI can encounter a broader range of scenarios thanpredictive AI during training, which leads to improved generalization and theability to generate nuanced and relevant outputs.

Unlike predictive AI, it is difficult to evaluate generative AI. For example,you might not know the training data distribution of the foundation models. Youmust build a custom evaluation dataset that reflects all your use cases,including the essential, average, and edge cases. Similar to fine-tuning data,you can use powerful LLMs to generate, curate, and augment data for buildingrobust evaluation datasets.

Evaluation

The evaluation process is a core activity of the development of generative AIapplications. Evaluation might have different degrees of automation: fromentirely driven by humans to entirely automated by a process.

When you're prototyping a project, evaluation is often a manual process.Developers review the model's outputs, getting a qualitative sense of how it'sperforming. But as the project matures and the number of test cases increases,manual evaluation becomes a bottleneck.

Automating evaluation has two big benefits: it lets you move faster and makesevaluations more reliable. It also takes human subjectivity out of the equation,which helps ensure that the results are reproducible.

But automating evaluation for generative AI applications comes with its own setof challenges. For example, consider the following:

Both the inputs (prompts) and outputs can be incredibly complex. A singleprompt might include multiple instructions and constraints that the model mustmanage. The outputs themselves are often high-dimensional such as a generatedimage or a block of text. Capturing the quality of these outputs in a simplemetric is difficult. Some established metrics, likeBLEU for translations andROUGE for summaries, aren'talways sufficient. Therefore, you can use custom evaluation methods or anotherfoundation model to evaluate your system. For example, you could prompt alarge language model (such asAutoSxS) to score thequality of generated texts across various dimensions.
Many evaluation metrics for generative AI are subjective. What makes oneoutput better than another can be a matter of opinion. You must make sure thatyour automated evaluation aligns with human judgment because you want yourmetrics to be a reliable proxy of what people would think. To ensurecomparability between experiments, you must determine your evaluation approachand metrics early in the development process.
Lack of ground truth data, especially in the early stages of a project. Oneworkaround is to generate synthetic data to serve as a temporary ground truththat you can refine over time with human feedback.
Comprehensive evaluation is essential for safeguarding generative AIapplications against adversarial attacks. Malicious actors can craft promptsto try to extract sensitive information or manipulate the model's outputs.Evaluation sets need to specifically address these attack vectors, throughtechniques like prompt fuzzing (feeding the model random variations onprompts) and testing for information leakage.

To evaluate generative AI applications, implement the following:

Automate the evaluation process to help ensure speed, scalability, andreproducibility. You can consider automation as a proxy for human judgment.
Customize the evaluation process as required for your use cases.
To ensure comparability, stabilize the evaluation approach, metrics, andground truth data as early as possible in the development phase.
Generate synthetic ground truth data to accommodate for the lack of realground truth data.
Include test cases of adversarial prompting as part of the evaluation set totest the reliability of the system itself against these attacks.

Deploy

Production-level generative AI applications are complex systems with manyinteracting components. To deploy a generative AI application to production, youmust manage and coordinate these components with the previous stages ofgenerative AI application development. For example, a single application mightuse several LLMs alongside a database, all fed by a dynamic data pipeline. Eachof these components can require its own deployment process.

Deploying generative AI applications is similar to deploying other complexsoftware systems because you must deploy system components such as databases andPython applications. We recommend that you use standard software engineeringpractices such asversioncontrol and CI/CD.

Version control

Generative AI experimentation is an iterative process that involves repeatedcycles of development, evaluation, and modification. To ensure a structured andmanageable approach, you must implement strict versioning for all modifiablecomponents. These components include the following:

Prompt templates: Unless you use specific prompt management solutions, useversion control tools to track versions.
Chain definitions: Use version control tools to track versions of the codethat defines the chain (including API integrations, database calls, andfunctions).
External datasets: In RAG systems, external datasets play an importantrole. Use existing data analytics solutions such as BigQuery,AlloyDB for PostgreSQL, and Vertex AI Feature Store to track these changes andversions of these datasets.
Adapter models: Techniques like LoRA tuning for adapter models areconstantly evolving. Use established data storage solutions (for example,Cloud Storage) to manage and version these assets effectively.

Continuous integration

In a continuous integration framework, every code change goes through automatictesting before merging to catch issues early. Unit and integration testing areimportant for quality and reliability. Unit tests focus on individual codepieces, while integration testing verifies that different components worktogether.

Implementing a continuous integration system helps to do the following:

Ensure reliable, high-quality outputs: Rigorous testing increasesconfidence in the system's performance and consistency.
Catch bugs early: Identifying issues through testing prevents them fromcausing bigger problems downstream. Catching bugs early makes the system morerobust and resilient to edge cases and unexpected inputs.
Lower maintenance costs: Well-documented test cases simplifytroubleshooting and enable smoother modifications in the future, reducingoverall maintenance efforts.

These benefits are applicable to generative AI applications. Apply continuousintegration to all elements of the system, including the prompt templates,chain, chaining logic, any embedded models, and retrieval systems.

However, applying continuous integration to generative AI comes with thefollowing challenges:

Difficulty generating comprehensive test cases: The complex andopen-ended nature of generative AI outputs makes it hard to define and createan exhaustive set of test cases that cover all possibilities.
Reproducibility issues: Achieving deterministic, reproducible results istricky because generative models often have intrinsic randomness andvariability in their outputs, even for identical inputs. This randomness makesit harder to consistently test for expected behaviors.

These challenges are closely related to the broader question of how to evaluategenerative AI applications. You can apply many of the sameevaluationtechniques to the development of CI systemsfor generative AI.

Continuous delivery

After the code is merged, a continuous delivery process begins to move the builtand tested code through environments that closely resemble production forfurther testing before the final deployment.

As described inDevelop and experiment, chainelements become one of the main components to deploy because they fundamentallyconstitute the generative AI application. The delivery process for thegenerative AI application that contains the chain might vary depending on thelatency requirements and whether the use case is batch or online.

Batch use cases require that you deploy a batch process that is executed on aschedule in production. The delivery process focuses on testing the entirepipeline in integration in an environment that is similar to production beforedeployment. As part of the testing process, developers can assert specificrequirements around the throughput of the batch process itself and check thatall components of the application are functioning correctly. (For example,developers can check permissions, infrastructure, and code dependencies.)

Online use cases require that you deploy an API, which is the application thatcontains the chain and is capable of responding to users at low latency. Yourdelivery process involves testing the API in integration in an environment thatis similar to production. These tests verify that all components of theapplication are functioning correctly. You can verify non-functionalrequirements (for example, scalability, reliability, and performance) through aseries of tests, including load tests.

Deployment checklist

The following list describes the steps to take when you deploy a generative AIapplication using a managed service such as Vertex AI:

Configure version control: Implement version control practices for modeldeployments. Version control lets you roll back to previous versions ifnecessary and track changes made to the model or deployment configuration.
Optimize the model: Perform model optimization tasks (distillation,quantization, and pruning) before packaging or deploying the model.
Containerize the model: Package the trained model into a container.
Define the target hardware requirements: Ensure the target deploymentenvironment meets the requirements for optimal performance of the model, suchas GPUs, TPUs, and other specialized hardware accelerators.
Define the model endpoint: Specify the model container, input format,output format, and any additional configuration parameters.
Allocate resources: Allocate the appropriate compute resources for theendpoint based on the expected traffic and performance requirements.
Configure access control: Set up access control mechanisms to restrictaccess to the endpoint based on authentication and authorization policies.Access control helps ensure that only authorized users or services caninteract with the deployed model.
Create model endpoint:Create anendpointto deploy the model as a REST API service. The endpoint lets clients sendrequests to the endpoint and receive responses from the model.
Configure monitoring and logging: Set up monitoring and logging systems totrack the endpoint's performance, resource utilization, and error logs.
Deploy custom integrations: Integrate the model into custom applicationsor services using the model's SDK or APIs.
Deploy real-time applications: Create a streaming pipeline that processesdata and generates responses in real time.

Log and monitor

Monitoring generative AI applications and their components requires techniquesthat you can add to the monitoring techniques that you use for conventionalMLOps. You must log and monitor your application end-to-end, which includeslogging and monitoring the overall input and output of your application andevery component.

Inputs to the application trigger multiple components to produce the outputs. Ifthe output to a given input is factually inaccurate, you must determine which ofthe components didn't perform well. You require lineage in your logging for allcomponents that were executed. You must also map the inputs and components withany additional artifacts and parameters that they depend on so that you cananalyze the inputs and outputs.

When applying monitoring, prioritize monitoring at the application level. Ifapplication-level monitoring proves that the application is performing well, itimplies that all components are also performing well. Afterwards, applymonitoring to the prompted model components to get more granular results and abetter understanding of your application.

As with conventional monitoring in MLOps, you must deploy an alerting process tonotify application owners when drift, skew, or performance decay is detected. Toset up alerts, you must integrate alerting and notification tools into yourmonitoring process.

The following sections describe monitoring skew and drift and continuousevaluation tasks. In addition, monitoring in MLOps includes monitoring themetrics for overall system health like resources utilization and latency. Theseefficiencymetrics also applyto generative AI applications.

Skew detection

Skew detection in conventional ML systems refers to training-serving skew thatoccurs when the feature data distribution in production deviates from thefeature data distribution that was observed during model training. Forgenerative AI applications that use pretrained models in components that arechained together to produce the output, you must also measure skew. You canmeasure skew by comparing the distribution of the input data that you used toevaluate your application and the distribution of the inputs to your applicationin production. If the two distributions drift apart, you must investigatefurther. You can apply the same process to the output data as well.

Drift detection

Like skew detection, drift detection checks for statistical differences betweentwo datasets. However, instead of comparing evaluations and serving inputs,drift looks for changes in input data. Drift lets you evaluate the inputs andtherefore how the behavior of your users changes over time.

Given that the input to the application is typically text, you can use differentmethods to measure skew and drift. In general, these methods are trying toidentify significant changes in production data, both textual (such as size ofinput) and conceptual (such as topics in input), when compared to the evaluationdataset. All these methods are looking for changes that could indicate theapplication might not be prepared to successfully handle the nature of the newdata that are now coming in. Some common methods including the following:

Calculating embeddings and distances
Counting text length and number of tokens
Tracking vocabulary changes, new concepts and intents, prompts and topics indatasets
Using statistical approaches such asleast-squares density difference(PDF),maximum meandiscrepancy (MMD),learned kernel MMD (PDF), orcontext-aware MMD.

Because generative AI use cases are so diverse, you might require additionalcustom metrics that better capture unexpected changes in your data.

Continuous evaluation

Continuous evaluation is another common approach to generative AI applicationmonitoring. In a continuous evaluation system, you capture the model'sproduction output and run an evaluation task using that output to keep track ofthe model's performance over time. You can collect direct user feedback, such asratings, which provide immediate insight into the perceived quality of outputs.In parallel, comparing model-generated responses against established groundtruth allows for deeper analysis of performance. You can collect ground truththrough human assessment or as a result of an ensemble AI model approach togenerate evaluation metrics. This process provides a view on how your evaluationmetrics changed from when you developed your model to what you have inproduction today.

Govern

In the context of MLOps, governance encompasses all the practices and policiesthat establish control, accountability, and transparency over the development,deployment, and ongoing management of machine learning models, including all theactivities related to the code, data, and model lifecycles.

In predictive AI applications, lineage focuses on tracking and understanding thecomplete journey of a machine learning model. In generative AI, lineage goesbeyond the model artifact to extend to all the components in the chain. Trackingincludes the data, models, model lineage, code, and the relative evaluation dataand metrics. Lineage tracking can help you audit, debug, and improve yourmodels.

Along with these new practices, you can govern the data lifecycleand the generative AI component lifecycles using standard MLOps and DevOpspractices.

What's next

Deploy a generative AI application usingVertex AI

Authors: Anant Nawalgaria, Christos Aniftos, Elia Secchi, Gabriela HernandezLarios, Mike Styer, and Onofrio Petragallo

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-11-19 UTC.

Movatterモバイル変換

Deploy and operate generative AI applications Stay organized with collections Save and categorize content based on your preferences.