Movatterモバイル変換

ホーム

AI Inference Solutions

NVIDIA Inference Platform

Powering the most performant, efficient, and profitable AI factories.

Get Started

Read Series | Performance Benchmarks | For Developers

Get Started

Overview

What’s the Smart Way to Scale AI at The Lowest Cost?

AI inference—how we experience AI through chatbots, copilots, and creative tools—is scaling at a double exponential pace. User adoption is accelerating while the AI tokens generated per interaction, driven by agentic workflows, long-thinking reasoning, andmixture-of-expert (MoE) models, soars in parallel.

To enable inference at this massive scale, NVIDIA delivers data-center-scale architecture on an annual rhythm. Our extreme hardware and software codesign delivers order-of-magnitude leaps in performance, drives down the cost per token, and unlocks greater revenue and profit.

NVIDIA Blackwell NVL72 delivers more than 10x better inference performance compared to NVIDIA H200 across a broad range of MoE models, including Kimi K2 Thinking, DeepSeek-R1, and Mistral Large 3.

Extreme Codesign Delivers 1/10th the Cost With NVLink for Mixture of Experts (MoE)

By processing ten times as many tokens using the same time and power, the cost per token drops dramatically, enabling MoEs to be deployed into everyday products. This is how frontier intelligence becomes mainstream.

Read Blog

NVIDIA Blackwell Sets Standard for Inference ROI

NVIDIA Blackwell swept the new SemiAnalysis InferenceMAX™ v1 benchmarks, achieving the highest AI inference performance and best overall efficiency. NVIDIA Blackwell enables the highest AI factory revenue: A $5M investment in GB200 NVL72 generates $75 million in token revenue—a 15x return on investment.

Explore Key Results

Inference Performance Drives Down Token Cost

Click to Enlarge Image

DeepSeek-R1 8K/1K results show a 15x performance benefit and revenue opportunity for NVIDIA Blackwell GB200 NVL72 over Hopper H200.

Benefits

Explore the Think SMART Advantage

The NVIDIA inference platform delivers a range of benefits captured in theThink SMART framework—spanning scale and efficiency, multidimensional performance, architecture and software codesign, ROI-driven by performance, and an extensive technology ecosystem.

Maximize performance

NVIDIA Blackwell delivers industry-leading performance across diverse use cases, effectively balancing multiple dimensions: throughput, latency, intelligence, cost, and energy efficiency. For intelligent mixture-of-experts models such as Kimi K2 Thinking, DeepSeek-R1, and Mistral Large 3, users can achieve up to 10x faster performance on NVIDIA Blackwell NVL72 compared with H200.

Lower Cost Per Token

NVIDIA Blackwell NVL72 delivers 1/10th the cost per token for MoE models. Performance is the biggest lever to drive down cost per token and maximize AI revenue. By processing ten times as many tokens using the same time and power, the cost per token drops dramatically, enabling MoEs to be deployed into everyday products.

Scale Efficiently

With full-stack innovation across compute, networking, and software, NVIDIA enables you to efficiently scale complex AI deployments.

Integrate Easily

NVIDIA provides a proven platform with an install base of hundreds of millions of CUDA® GPUs, 7 million developers, contributions to 1,000+ open-source projects, and deep framework integrations with frameworks like PyTorch, JAX, SGLang, vLLM, and more

ROI

Performance Drives Profitability

The faster your system can generate tokens while delivering a seamless user experience, the more revenue you can make from the same power and cost footprint. NVIDIA Blackwell delivers $75M in revenue for every $5M CAPEX spent, a 15x return on investment.

Learn More

Platform

Extreme Hardware–Software Codesign

Powerful hardware without smart orchestration wastes potential; great software without fast hardware means sluggish inference performance. NVIDIA’s full-stack innovation across compute, networking, and software enables the highest performance across diverse workloads. Explore some of the key NVIDIA hardware and software innovations.

NVIDIA Grace Blackwell NVL72

Delivering 1.4 exaFLOPS in a single rack, the NVIDIA GB200 NVL72 unifies 72 NVIDIA Blackwell GPUs with NVIDIA NVLink™ and NVSwitch™ to deploy massive reasoning models at scale while reducing token costs by 1/10th

Learn More About GB200 NVL72

NVIDIA Blackwell

The NVIDIA HGX™ B200, based on the NVIDIA Blackwell architecture, features 8 NVIDIA Blackwell GPUs connected by ultra-fast NVSwitch. It delivers high AI inference performance and energy efficiency for large-scale AI inference.

Learn More About NVIDIA B200 HGX Platform

NVIDIA Dynamo

NVIDIA Dynamo is a distributed inference-serving framework to deploy models in multi-node environments at AI-factory-scale. It streamlines distributed serving by disaggregating inference, optimizing routing, and extending memory through data caching to cost-effective storage tiers.

Learn More About NVIDIA Dynamo

TensorRT LLM

TensorRT™-LLM is an open-source library for high-performance, real-time LLM inference on NVIDIA GPUs. With a modular Python runtime, PyTorch-native authoring, and a stable production API, it’s optimized to maximize throughput, minimize costs, and deliver fast user experiences.

Learn More About TensorRT-LLM

Hardware

Explore Our AI Inference Infrastructure

Get unmatched AI performance with NVIDIA AI inference software optimized forNVIDIA-accelerated infrastructure. The NVIDIA Blackwell Ultra, H200 GPU, NVIDIA RTX PRO™ 6000 Blackwell Server Edition, and NVIDIA RTX™ technologies deliver exceptional speed and efficiency for AI inference workloads across data centers, clouds, and workstations.

NVIDIA GB300 NVL72

AI inference demand is surging—and NVIDIA Blackwell Ultra is built to meet that moment. Delivering 1.4 exaFLOPS in a single rack, the NVIDIA GB300 NVL72 unifies 72 NVIDIA Blackwell Ultra GPUs with NVIDIA NVLink™ andNVFP4 to power massive models with extreme efficiency, achieving 50x higher AI factory output while reducing token costs and accelerating real-time reasoning at scale.

Learn More About GB300 NVL72

NVIDIA H200 GPU

The NVIDIA H200 GPU—part of the NVIDIA Hopper Platform— supercharges generative AI and high-performance computing (HPC) workloads with game-changing performance and memory capabilities. As the first GPU with HBM3e, the H200’s larger and faster memory fuels the acceleration of generative AI and large language models (LLMs) while advancing scientific computing for HPC workloads.

Learn More About H200

NVIDIA RTX PRO 6000 Blackwell Server Edition

The RTX PRO 6000 Blackwell Server Edition GPU delivers supercharged inferencing performance across a broad range of AI models, achieving up to 5x higher performance for enterprise-scale agentic and generative AI applications compared to the previous-generation NVIDIA L40S. NVIDIA RTX PRO™ Servers, available from global system partners, bring the performance and efficiency of the Blackwell architecture to every enterprise data center.

Learn More RTX PRO 6000 Blackwell Server Edition

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

TheRTX PRO 6000 Blackwell Workstation Edition is the first desktop GPU to offer96 GB of GPU memory. The power of theBlackwell GPU architecture, combined with large GPU memory and theNVIDIA AI software stack, enables RTX PRO-powered workstations to deliver incredible acceleration for generative AI and LLM inference directly on the desktop.

Learn More About RTX PRO 6000 Blackwell Workstation Edition

Decoding the Performance Paretos

Ever wonder how complex AI trade-offs translate into real-world outcomes? Explore different points across the performance curves below to see firsthand how extreme hardware and software codesign make NVIDIA Blackwell Ultra the most performant, efficient, and profitable choice.

TPS / user

–

TPS / MW

–

Simulated Chat Experience

DeepSeek R1 ISL = 32K, OSL = 8K, GB300 NVL72 with FP4 Dynamo disaggregation. H100 with FP8 in-flight batching. Projected performance subject to change.

Wondering how each configuration translates to real user experiences? Explore the curves solo or with TJ’s guidance by clicking “Explore with TJ”, and see it brought to life in the simulated chat on the right.

Explore More With NVIDIA Dynamo AI Configurator

Customer Stories

How Industry Leaders Are Driving Innovation With AI Inference

Accelerate Generative AI Performance and Lower Costs

Read how Amdocs built amAIz, a domain-specific generative AI platform for telcos, using NVIDIA DGX™ Cloud and NVIDIA NIM inference microservices to improve latency, boost accuracy, and reduce costs.

Read Case Study

Snapchat

Enhancing Apparel Shopping With AI

Learn how Snapchat enhanced the clothes shopping experience and emoji-aware optical character recognition using Triton Inference Server to scale, reduce costs, and accelerate time to production.

Read Case Study

Amazon

Accelerate Customer Satisfaction

Discover how Amazon improved customer satisfaction by accelerating their inference 5X faster with TensorRT.

Read Case Study

Resources

The Latest in AI Inference Resources

Blogs
Sessions
Training
Videos

View More Sessions

Get Started With Inference on NVIDIA LaunchPad

Have an existing AI project? Apply to get hands-on experience testing and prototyping your AI solutions.

Apply Now

Explore Generative AI and LLM Learning Paths

Elevate your technical skills in generative AI and large language models with our comprehensive learning paths.

Explore Now

Get Started With Generative AI Inference on NVIDIA LaunchPad

Fast-track your generative AI journey with immediate, short-term access to NVIDIA NIM inference microservices and AI models—for free.

Get Started

View More Training

Deploying Generative AI in Production With NVIDIA NIM

Unlock the potential of generative AI with NVIDIA NIM. This video dives into how NVIDIA NIM microservices can transform your AI deployment into a production-ready powerhouse.

Watch Video(01:55)

Top 5 Reasons Why Triton Is Simplifying Inference

Triton Inference Server simplifies the deployment of AI models at scale in production. Open-source inference-serving software lets teams deploy trained AI models from any framework—from local storage or cloud platform—on any GPU- or CPU-based infrastructure.

Watch Video(01:59)

UneeQ

NVIDIA Unveils NIMs

Ever wondered what NVIDIA’s NIM technology is capable of? Delve into the world of mind-blowing digital humans and robots to see what NIMs make possible.

Watch Video(13:42)

View More Videos

Next Steps

Ready to Get Started?

Explore everything you need to start developing your AI application, including the latest documentation, tutorials, technical blogs, and more.

Start Developing Start Building

Find the Right Hardware for Your Inference Workloads

NVIDIA data center solutions are available through select NVIDIA Partner Network (NPN) partners. Explore flexible and affordable options for accessing the latest NVIDIA data center technologies through our network of partners.

Browse NVIDIA Marketplace

Get the Latest on NVIDIA AI Inference

Stay Informed

Get the latest from NVIDIA on AI Inference

Welcome back.Not you?Log Out

Welcomeback. Not you?Clear form

Section

Section

enterpriseOptIns hidden field

First Name

Last Name

Business Email Address

Organization / University Name

Industry

Job Title

Location

Preferred Language

State/Province

nvid hidden field

ncid hidden field

NVIDIA Privacy Policy

I agree to the collection and processing of the above information by NVIDIA <span class="corporation-txt hidden">Corporation </span>for the purposes of research and event organization, and I have read and agree to <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/?deeplink=visiting-our-website" target="_blank">NVIDIA Privacy Policy</a>.

I agree that the above information will be transferred to NVIDIA Corporation in the United States and stored in a manner consistent with <a href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/?deeplink=visiting-our-website" target="_blank">NVIDIA Privacy Policy</a> due to necessities for research, event organization and corresponding NVIDIA internal management and system operation need. You may contact us by sending an email to <a href="mailto:privacy@nvidia.com">privacy@nvidia.com</a> to resolve related problems.