Multi-agent AI system in Google Cloud

Last reviewed 2025-09-16 UTC

This document provides a reference architecture to help you design robustmulti-agent AI systems in Google Cloud. A multi-agent AI system optimizescomplex and dynamic processes by segmenting them into discrete tasks thatmultiple specialized AI agents collaboratively execute.

The intended audience for this document includes architects, developers, andadministrators who build and manage AI infrastructure and applications in thecloud. This document assumes a foundational understanding of AI agents andmodels. The document doesn't provide specific guidance for designing and codingAI agents.

TheDeployment section of this document lists code samples that you can use to learn how tobuild and deploy multi-agent AI systems.

Architecture

The following diagram shows an architecture for an example of a multi-agent AIsystem that's deployed in Google Cloud.

Architecture for a multi-agent AI system in Google Cloud.Architecture for a multi-agent AI system in Google Cloud.

Architecture components

The example architecture in the preceding section contains the followingcomponents:

ComponentDescription
FrontendUsers interact with the multi-agent system through a frontend, such as a chat interface, that runs as a serverless Cloud Run service.
Agents

A coordinator agent controls the agentic AI system in this example. The coordinator agent invokes an appropriate subagent to trigger the agentic flow. The agents can communicate with each other by using theAgent2Agent (A2A) protocol, which enables interoperability between agents regardless of their programming language and runtime. The example architecture shows agents in asequential pattern and aniterative refinement pattern.

For more information about the subagents in this example, see theAgentic flow section.

Agents runtimeAI agents can be deployed asserverless Cloud Run services, as containerized apps onGoogle Kubernetes Engine (GKE), or onVertex AI Agent Engine.
ADKAgent Development Kit (ADK) provides tools and a framework to develop, test, and deploy agents. ADK abstracts the complexity of agent creation and lets AI developers focus on the agent's logic and capabilities.
AI model and model runtimesFor inference serving, the agents in this example architecture use an AI model onVertex AI. The architecture showsCloud Run andGKE as alternative runtimes for the AI model that you choose to use.
Model ArmorModel Armor enables inspection and sanitization of inputs and responses for models that are deployed in Vertex AI and GKE. For more information, seeModel Armor integration with Google Cloud services.
MCP clients, servers, and tools The Model Context Protocol (MCP) facilitates access to tools by standardizing the interaction between agents and tools. For each agent-tool pair, an MCP client sends requests to an MCP server through which the agent accesses a tool such as a database, a file system, or an API.

Agentic flow

The example multi-agent system in the preceding architecture has the followingflow:

  1. A user enters a prompt through a frontend, such as a chat interface, whichruns as a serverless Cloud Run service.
  2. The frontend forwards the prompt to a coordinator agent.
  3. The coordinator agent starts one of the following agentic flows based on theintent that's expressed in the prompt.

    • Sequential:
      1. The task-A subagent performs a task.
      2. The task-A subagent invokes the task-A.1 subagent.
    • Iterative refinement:

      1. The task-B subagent performs a task.
      2. The quality evaluator subagent reviews the output of the task-Bsubagent.
      3. If the output is unsatisfactory, the quality evaluator invokes theprompt enhancer subagent to refine the prompt.
      4. The task-B subagent performs its task again by using the enhancedprompt.

      This cycle continues until the output is satisfactory or the maximumnumber of iterations is reached.

    The example architecture includes ahuman-in-the-loop path to let human users intervene in the agentic flow when necessary.

  4. The task-A.1 subagent and quality evaluator subagent independently invokethe response generator subagent.

  5. The response generator subagent generates a response, performs validationand grounding checks, and then it sends the final response to the userthrough the coordinator agent.

Products and tools used

This reference architecture uses the following Google Cloud andthird-party products and tools:

  • Cloud Run: A serverless compute platform that lets you runcontainers directly on top of Google's scalable infrastructure.
  • Vertex AI: An ML platform that lets you train and deploy ML modelsand AI applications, and customize LLMs for use in AI-powered applications.
  • Google Kubernetes Engine (GKE): A Kubernetes service that you can use to deployand operate containerized applications at scale using Google's infrastructure.
  • Model Armor: A service that provides protection for your generative and agentic AIresources against prompt injection, sensitive data leaks, and harmful content.
  • Agent Development Kit (ADK): A set of tools and libraries todevelop, test, and deploy AI agents.
  • Agent2Agent (A2A) protocol: An open protocol that enables communication and interoperabilitybetween agents regardless of their programming language and runtime.
  • Model Context Protocol (MCP): An open-source standard for connecting AI applications to externalsystems.

Use cases

Multi-agent AI systems are suitable for complex use cases that requirecollaboration and coordination across multiple specialized skill sets to achievea business goal. To identify use cases that multi-agent AI systems are suitablefor, analyze your business processes and identify specific tasks that AI canaugment. Focus on tangible business outcomes, like cost reduction andaccelerated processing. This approach helps align your investments in AI withbusiness value.

The following are examples of use cases for multi-agent AI systems.

Financial advisor

Provide personalized stock trading recommendations and execute trades. Thefollowing diagram shows an example of an agentic flow for this use case. Thisexample uses a sequential pattern.

Financial advisor use case for a multi-agent system.

The diagram shows the following flow:

  1. A data retriever agent retrieves real-time and historical stock prices,company financial reports, and other relevant data from reliable sources.
  2. A financial analyzer agent applies appropriate analytics and chartingtechniques to the data, identifies price movement patterns, and makespredictions.
  3. A stock recommender agent uses the analysis and charts to generatepersonalized recommendations to buy and sell specific stocks based on the user'srisk profile and investment goals.
  4. A trade executor agent buys and sells stocks on behalf of the user.

Research assistant

Create a research plan, gather information, evaluate and refine the research,and then compose a report. The following diagram shows an example of anagentic flow for this use case. The main flow in this example uses a sequentialpattern. The example also includes an iterative refinement pattern.

Research assistant use case for a multi-agent system.

The diagram shows the following flow:

  1. A planner agent creates a detailed research plan.
  2. A researcher agent completes the following tasks:

    1. Uses the research plan to identify appropriate internal and external datasources.
    2. Gathers and analyzes the required data.
    3. Prepares a research summary and provides the summary to an evaluatoragent.

    The researcher agent repeats these tasks until the evaluator agent approvesthe research.

  3. A report composer agent creates the final research report.

Supply chain optimizer

Optimize inventory, track shipments, and communicate with supply chain partners.The following diagram shows an example of an agentic flow for this use case.This example uses a sequential pattern.

Supply chain optimizer use case for a multi-agent system.

  1. A warehouse manager agent ensures optimal stock levels by creating re-stockorders based on inventory, demand forecasts, and supplier lead times.

    • The agent interacts with the shipment tracker agent to track deliveries.
    • The agent interacts with the supplier communicator agent to notifysuppliers about changes in orders.
  2. A shipment tracker agent ensures timely and efficient fulfillment of ordersby integrating with suppliers' logistics platforms and carrier systems.

  3. A supplier communicator agent communicates with external suppliers on behalfof the other agents in the system.

Design alternatives

Depending on your requirements for manageability, control, and flexibility, youcan choose from a range of runtime options in Google Cloud for your AIagents and model. For more information, seeChoose your agentic AI architecture components.

Design considerations

This section describes design factors, best practices, and recommendations toconsider when you use this reference architecture to develop a topology thatmeets your specific requirements for security, reliability, cost, andperformance.

The guidance in this section isn't exhaustive. Depending on your workload'srequirements and the Google Cloud and third-party products and featuresthat you use, there might be additional design factors and trade-offs that youshould consider.

System design

This section provides guidance to help you choose Google Cloud regionsfor your deployment and to select appropriate Google Cloud products andtools.

Region selection

When you select Google Cloud regions for your AI applications, considerthe following factors:

To select appropriate Google Cloud locations for your applications, usethe following tools:

  • Google Cloud Region Picker:An interactive web-based tool to select the optimal Google Cloudregion for your applications and data based on factors like carbonfootprint, cost, and latency.
  • Cloud Location Finder API:A public API that provides a programmatic way to find deploymentlocations in Google Cloud, Google Distributed Cloud, and other cloudproviders.

Agent design

This section provides general recommendations for designing AI agents. Detailedguidance about writing agent code and logic is outside the scope of thisdocument.

Design focusRecommendations
Agent definition and design
  • Clearly define the business goal of the agentic AI system and the task that each agent performs.
  • Choose anagent design pattern that best meets your requirements.
  • Use ADK to efficiently create, deploy, and manage your agentic architecture.
Agent interactions
  • Design the human-facing agents in the architecture to support natural language interactions.
  • Ensure that each agent clearly communicates its actions and status to its dependent clients.
  • Design the agents to detect and handle ambiguous queries and nuanced interactions.
Context, tools, and data
  • Ensure that the agents have sufficientcontext to track multi-turn interactions and session parameters.
  • Clearly describe the purpose, arguments, and usage of the tools that the agents can use.
  • Ensure that the agents' responses are grounded in reliable data sources to reduce hallucinations.
  • Implement logic to handle no-match situations, such as when a prompt is off-topic.

Security

This section describes design considerations and recommendations to design atopology in Google Cloud that meets your workload's security requirements.

ComponentDesign considerations and recommendations
Agents

AI agents introduce certain unique and critical security risks that conventional, deterministic security practices might not be able to mitigate adequately. Google recommends anapproach that combines the strengths of deterministic security controls with dynamic, reasoning-based defenses. This approach is grounded in three core principles: human oversight, carefully defined agent autonomy, and observability. The following are specific recommendations that are aligned with these core principles.

Human oversight: An agentic AI system might sometimes fail or not perform as expected. For example, the model might generate inaccurate content or an agent might select inappropriate tools. In business-critical agentic AI systems, incorporate a human-in-the-loop flow to let human supervisors monitor, override, and pause agents. For example, human users can review the output of agents, approve or reject the outputs, and provide further guidance to correct errors or to make strategic decisions. This approach combines the efficiency of agentic AI systems with the critical thinking and domain expertise of human users.

Access control for agents: Configure agent permissions by using Identity and Access Management (IAM) controls. Grant each agent only the permissions that it needs to perform its tasks and to communicate with tools and with other agents. This approach helps to minimize the potential impact of a security breach, because a compromised agent would have limited access to other parts of the system. For more information, seeSet up the identity and permissions for your agent andManaging access for deployed agents.

Monitoring: Monitor agent behavior by using comprehensive tracing capabilities that give you visibility into every action that an agent takes, including its reasoning process, tool selection, and execution paths. For more information, seeLogging an agent in Vertex AI Agent Engine andLogging in the ADK.

For more information about securing AI agents, seeSafety and Security for AI Agents.

Vertex AI

Shared responsibility: Security is a shared responsibility. Vertex AI secures the underlying infrastructure and provides tools and security controls to help you protect your data, code, and models. You are responsible for properly configuring your services, managing access controls, and securing your applications. For more information, see Vertex AI shared responsibility.

Security controls: Vertex AI supports Google Cloud security controls that you can use to meet your requirements for data residency,customer-managed encryption keys (CMEK), network security usingVPC Service Controls, andAccess Transparency. For more information, see the following documentation:

Safety: AI models might produce harmful responses, sometimes in response to malicious prompts.

  • To enhance safety and mitigate potential misuse of the agentic AI system, you can configure content filters to act as barriers to harmful inputs and responses. For more information, seeSafety and content filters.
  • To inspect and sanitize inference requests and responses for threats like prompt injection and harmful content, you can useModel Armor. Model Armor helps you prevent malicious input, verify content safety, protect sensitive data, maintain compliance, and enforce safety and security policies consistently.

Model access: You can set up organization policies to limit the type and versions of AI models that can be used in a Google Cloud project. For more information, seeControl access to Model Garden models.

Data protection: To discover and de-identify sensitive data in the prompts and responses and in log data, use theCloud Data Loss Prevention API. For more information, see this video:Protecting sensitive data in AI apps.

MCPWhen you configure your agents to use MCP, ensure that access to external data and tools is authorized, implement privacy controls like encryption, apply filters to protect sensitive data, and monitor agent interactions. For more information, seeMCP and Security.
A2A

Transport security: The A2A protocol mandates HTTPS for all A2A communication in production environments and it recommendsTransport Layer Security (TLS) versions 1.2 or higher.

Authentication: The A2A protocol delegates authentication to standard web mechanisms like HTTP headers and to standards like OAuth2 and OpenID Connect. Each agent advertises the authentication requirements in its Agent Card. For more information, seeA2A authentication.

Cloud Run

Ingress security (for the frontend service): To control access to the application,disable the defaultrun.app URL of the frontend Cloud Run service andset up a regional external Application Load Balancer. In addition to load-balancing incoming traffic to the application, the load balancer handles SSL certificate management. For added protection, you can useGoogle Cloud Armor security policies to provide request filtering, DDoS protection, and rate limiting for the service.

User authentication:

  • Users inside your organization: To authenticate internal user access to the frontend Cloud Run service, use Identity-Aware Proxy (IAP). When a user tries to access an IAP-secured resource, IAP performs authentication and authorization checks.
  • Users outside your organization: To authenticate external user access to the frontend service, useIdentity Platform orFirebase Authentication. To manage external user access, configure your application to handle a sign-in flow and to make authenticated API calls to the Cloud Run service.

For more information, seeAuthenticating users.

Container image security: To ensure that only authorized container images are deployed to Cloud Run, you can use Binary Authorization. To identify and mitigate security risks in the container images, use Artifact Analysis to automatically run vulnerability scans. For more information, seeContainer scanning overview.

Data residency: Cloud Run helps you meet data residency requirements. Your Cloud Run functions run within the selectedregion.

For more guidance about container security, seeGeneral Cloud Run development tips.

All of the products in the architecture

Data encryption: By default, Google Cloud encrypts data at rest by using Google-owned and Google-managed encryption keys. To protect your agents' data by using encryption keys that you control, you can useCMEKs that you create and manage in Cloud KMS. For information about Google Cloud services that are compatible with Cloud KMS, seeCompatible services.

Mitigate data exfiltration risk: To reduce the risk of data exfiltration, create aVPC Service Controls perimeter around the infrastructure. VPC Service Controls supports all of the Google Cloud services that this reference architecture uses.

Access control: When you configure permissions for the resources in your topology, follow the principle ofleast privilege.

Cloud environment security: Use the tools inSecurity Command Center to detect vulnerabilities, identify and mitigate threats, define and deploy a security posture, and export data for further analysis.

Post-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize security by using Active Assist. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist.

More security recommendations

Reliability

This section describes design considerations and recommendations to build andoperate reliable infrastructure for your deployment in Google Cloud.

ComponentDesign considerations and recommendations
Agents

Fault tolerance: Design the agentic system to tolerate or handle agent-level failures. Where feasible, use a decentralized approach where agents can operate independently.

Simulate failures: Before deploying the agentic AI system to production, validate it by simulating a production environment. Identify and fix inter-agent coordination issues and unexpected behaviors.

Error handling: To enable diagnosis and troubleshooting of errors, implement logging, exception handling, and retry mechanisms.

Vertex AI

Quota management: Vertex AI supportsdynamic shared quota (DSQ) for Gemini models. DSQ helps to flexibly manage pay-as-you-go requests, and it eliminates the need to manage quota manually or to request quota increases. DSQ dynamically allocates the available resources for a given model and region across active customers. With DSQ, there are no predefined quota limits on individual customers.

Capacity planning: If the number of requests to the model exceeds the allocated capacity, thenerror code 429 is returned. For workloads that are business critical and that require consistently high throughput, you can reserve throughput by usingProvisioned Throughput.

Model endpoint availability: If data can be shared across multiple regions or countries, you can use aglobal endpoint for the model.

Cloud RunRobustness to infrastructure outages: Cloud Run is a regional service. It stores data synchronously across multiple zones within a region and it automatically load-balances traffic across the zones. If a zone outage occurs, Cloud Run continues to run and data isn't lost. If a region outage occurs, the service stops running until Google resolves the outage.
All of the products in the architecturePost-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize reliability by using Active Assist. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist.

For reliability principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Reliabilityin the Well-Architected Framework.

Operations

This section describes the factors to consider when you use this referencearchitecture to design a Google Cloud topology that you can operateefficiently.

ComponentDesign considerations and recommendations
Vertex AI

Monitoring using logs: By default, agent logs that are written to thestdout andstderr streams are routed to Cloud Logging. For advanced logging, you can integrate the Python logger with Cloud Logging. If you need full control over logging and structured logs, use the Cloud Logging client. For more information, seeLogging an agent andLogging in ADK.

Continuous evaluation: Regularly perform a qualitative evaluation of the output of the agents and thetrajectory or steps taken by the agents to produce the output. To implement agent evaluation, you can use theGen AI evaluation service or theevaluation methods that ADK supports.

MCP

Database tools: To efficiently manage database tools for your AI agents and to ensure that the agents securely handle complexities like connection pooling and authentication, use theMCP Toolbox for Databases. It provides a centralized location to store and update database tools. You can share the tools across agents and update the tools without redeploying agents. The toolbox includes a wide range of tools for Google Cloud databases like AlloyDB for PostgreSQL and for third-party databases like MongoDB.

Generative AI models: To enable AI agents to use Google generative AI models like Imagen and Veo, you can useMCP Servers for Google Cloud generative media APIs.

Google security products and tools: To enable your AI agents to access Google security products and tools like Google Security Operations, Google Threat Intelligence, and Security Command Center, useMCP servers for Google security products.

All of the Google Cloud products in the architectureTracing: Continuously gather and analyze trace data by using Cloud Trace. Trace data lets you rapidly identify and diagnose errors within complex agent workflows. You can perform in-depth analysis through visualizations in the Trace Explorer tool. For more information, seeTrace an agent.

For operational excellence principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Operational excellencein the Well-Architected Framework.

Cost optimization

This section provides guidance to optimize the cost of setting up and operatinga Google Cloud topology that you build by using this referencearchitecture.

ComponentDesign considerations and recommendations
Vertex AI

Cost analysis and management: To analyze and manage Vertex AI costs, we recommend that you create baseline metrics for queries per second (QPS) and tokens per second (TPS). Then, monitor these metrics after deployment. The baseline also helps with capacity planning. For example, the baseline helps you determine whenProvisioned Throughput might be necessary.

Model selection: Themodel that you select for your AI application directly affects both costs and performance. To identify the model that provides an optimal balance between performance and cost for your specific use case, test models iteratively. We recommend that you start with the most cost-efficient model and progress gradually to more powerful options.

Cost-effective prompting: The length of your prompts (input) and the generated responses (output) directly affect performance and cost. Write prompts that are short, direct, and provide sufficient context. Design your prompts to get concise responses from the model. For example, include phrases such as "summarize in 2 sentences" or "list 3 key points". For more information, see thebest practices for prompt design.

Context caching: To reduce the cost of requests that contain repeated content with high input token counts, usecontext caching.

Batch requests: When relevant, considerbatch prediction. Batched requests incur a lower cost than standard requests.

Cloud Run

Resource allocation: When you create a Cloud Run service, you can specify the amount of memory and CPU to be allocated. Start with the default CPU and memory allocations. Observe the resource usage and cost over time, and adjust the allocation as necessary. For more information, see the following documentation:

Rate optimization: If you can predict the CPU and memory requirements, you can save money withcommitted use discounts (CUDs).

All of the products in the architecturePost-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize cost by using Active Assist. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist.

To estimate the cost of your Google Cloud resources, use theGoogle Cloud Pricing Calculator.

For cost optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Cost optimizationin the Well-Architected Framework.

Performance optimization

This section describes design considerations and recommendations to design atopology in Google Cloud that meets the performance requirements of yourworkloads.

ComponentDesign considerations and recommendations
Agents

Model selection: When you select models for your agentic AI system, consider the capabilities that are required for the tasks that the agents need to perform.

Prompt optimization: To rapidly improve and optimize prompt performance at scale and to eliminate the need for manual rewriting, use theVertex AI prompt optimizer. The optimizer helps you efficiently adapt prompts across different models.

Vertex AI

Model selection: Themodel that you select for your AI application directly affects both costs and performance. To identify the model that provides an optimal balance between performance and cost for your specific use case, test models iteratively. We recommend that you start with the most cost-efficient model and progress gradually to more powerful options.

Prompt engineering: The length of your prompts (input) and the generated responses (output) directly affect performance and cost. Write prompts that are short, direct, and provide sufficient context. Design your prompts to get concise responses from the model. For example, include phrases such as "summarize in 2 sentences" or "list 3 key points". For more information, see thebest practices for prompt design.

Context caching: To reduce latency for requests that contain repeated content with high input token counts, usecontext caching.

Cloud Run

Resource allocation: Depending on your performance requirements, configure the memory and CPU to be allocated to the Cloud Run service. For more information, see the following documentation:

For more performance optimization guidance, seeGeneral Cloud Run development tips.

All of the products in the architecturePost-deployment optimization: After you deploy your application in Google Cloud, get recommendations to further optimize performance by using Active Assist. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist.

For performance optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Performance optimizationin the Well-Architected Framework.

Deployment

To learn how to build and deploy multi-agent AI systems, use the following codesamples. These code samples are fully functional starting points for learningand experimentation. For optimal operation in production environments, you mustcustomize the code based on your specific business and technical requirements.

  • Financial advisor:Analyze stock market data, create trading strategies, define executionplans, and evaluate risks.
  • Research assistant:Plan and conduct research, evaluate the findings, and compose a researchreport.
  • Insurance agent:Create memberships, provide roadside assistance, and handle insurance claims.
  • Search optimizer:Find search keywords, analyze web pages, and provide suggestions tooptimize search.
  • Data analyzer:Retrieve data, perform complex manipulations, generate visualizations, andrun ML tasks.
  • Web-marketing agent:Choose a domain name, design a website, create campaigns, and producecontent.
  • Airbnb planner (with A2A and MCP): For a given location and time, find Airbnb listingsand get weather information.

For code samples to get started with using ADK together with MCPservers, seeMCP Tools.

What's next

Contributors

Author:Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-09-16 UTC.