
Prompt
Hi, I'm going to provide you below a bunch of copy-pasted information from Nvidia about their Hopper, Blackwell, and Rubin data center GPU lines. I want you to make me McKinsey-style slide with the specifications and important differences between these generations of Nvidia chips. I care most about compute, HBM, scale up world size, interconnect, mem bandwidth. The slide will need to be fairly information dense! Use whatever web technologies you want for making this. # Nvidia Data Center GPU Architecture Report: Hopper, Blackwell, and Rubin ## The evolution of AI computing infrastructure Nvidia's data center GPU portfolio represents a revolutionary progression in AI computing, with each architecture generation delivering exponential performance improvements. The transition from Hopper (current) to Blackwell (2025) and Rubin (2026) demonstrates Nvidia's shift to an annual release cadence, fundamentally changing the pace of AI infrastructure development. The most striking advancement is Blackwell's dual-die design achieving **208 billion transistors** - a 2.6x increase over Hopper's 80 billion - while delivering up to **25x better energy efficiency** for AI inference. Looking ahead, Rubin's integration of custom ARM CPUs and HBM4 memory promises another 3.3x performance leap, establishing a clear path toward exascale AI computing. ## 1. Hopper Architecture: Current generation powerhouse ### Complete Product Specifications | **Model** | **H100 SXM5** | **H100 PCIe** | **H100 NVL** | **H200 SXM5** | **H200 NVL** | |-----------|---------------|---------------|--------------|---------------|--------------| | **CUDA Cores** | 16,896 | 14,592 | 14,592 | 16,896 | 14,592 | | **Tensor Cores** | 528 (4th Gen) | 456 (4th Gen) | 456 (4th Gen) | 528 (4th Gen) | 456 (4th Gen) | | **Memory** | 80GB HBM3 | 80GB HBM2e | 94GB HBM3 | 141GB HBM3e | 141GB HBM3e | | **Memory Bandwidth** | 3.35 TB/s | 2.0 TB/s | 3.9 TB/s | 4.8 TB/s | 4.8 TB/s | | **FP32 Performance** | 67 TFLOPS | 60 TFLOPS | 60 TFLOPS | 67 TFLOPS | 60 TFLOPS | | **FP16/BF16 Tensor** | 1,979 TFLOPS | 1,671 TFLOPS | 1,671 TFLOPS | 1,979 TFLOPS | 1,671 TFLOPS | | **FP8 Tensor** | 3,958 TFLOPS | 3,341 TFLOPS | 3,341 TFLOPS | 3,958 TFLOPS | 3,341 TFLOPS | | **TDP** | 700W | 350W | 350-400W | 700W | 600W | | **NVLink** | 900 GB/s | N/A | 600 GB/s | 900 GB/s | 900 GB/s | | **Form Factor** | SXM5 | PCIe dual-slot | PCIe dual-slot | SXM5 | PCIe dual-slot | | **MIG Support** | 7 instances | 7 instances | 7 instances | 7 instances | 7 instances | | **Availability** | Oct 2022 | Oct 2022 | Q1 2023 | Q2 2024 | Q4 2024 | | **Price (Est.)** | $25,000-30,000 | $22,000-25,000 | $30,970 | Premium over H100 | Premium over H100 | ### Hopper Key Technologies The Hopper architecture introduced several breakthrough technologies that redefined AI computing. The **Transformer Engine** with dynamic FP8/FP16 precision delivers up to 6x faster training for transformer models. Fourth-generation Tensor Cores support structured sparsity for 2x performance improvements, while **Confidential Computing** provides hardware-based security for sensitive workloads. The architecture's **80 billion transistors** on TSMC's custom 4N process achieve remarkable efficiency at 814mm² die size. ## 2. Blackwell Architecture: Revolutionary dual-die design ### Complete Product Specifications | **Model** | **B100** | **B200** | **GB200 Superchip** | **GB200 NVL72** | |-----------|----------|----------|---------------------|-----------------| | **Architecture** | Dual GB100 dies | Dual GB100 dies | 2x B200 + Grace CPU | 72x B200 + 36x Grace | | **Transistors** | 208 billion | 208 billion | 416B (GPU) + Grace | 14.9 trillion total | | **Memory (GPU)** | 192GB HBM3e | 192GB HBM3e | 384GB HBM3e | 13.4TB HBM3e | | **Memory Bandwidth** | 8 TB/s | 8 TB/s | 16 TB/s | 576 TB/s | | **FP32 Performance** | 80 TFLOPS | 80 TFLOPS | 160 TFLOPS | 5,760 TFLOPS | | **FP8 Tensor** | 10 PFLOPS | 10 PFLOPS | 20 PFLOPS | 720 PFLOPS | | **FP4 Tensor** | 20 PFLOPS | 20 PFLOPS | 40 PFLOPS | 1,440 PFLOPS | | **TDP** | 1000W | 1000W | ~1500W | 120kW rack | | **NVLink** | 1.8 TB/s | 1.8 TB/s | 3.6 TB/s | 130 TB/s total | | **Inter-die Link** | 10 TB/s NV-HBI | 10 TB/s NV-HBI | 10 TB/s + 900GB/s C2C | Multiple domains | | **CPU Specs** | N/A | N/A | 72 ARM cores, 480GB RAM | 2,592 ARM cores | | **Form Factor** | SXM6 | SXM6 | Superchip module | Full rack (3,000 lbs) | | **Availability** | Q1 2025 | Q1 2025 | Q1-Q2 2025 | Q2 2025 | | **Price (Est.)** | $30,000-35,000 | $45,000-50,000 | $60,000-70,000 | ~$3 million | ### Blackwell's architectural breakthrough Blackwell represents the first major GPU to overcome reticle limitations through a revolutionary dual-die design connected by **10 TB/s NV-HBI interconnect**. The second-generation Transformer Engine introduces FP4 precision, doubling AI performance while maintaining accuracy. With **208 billion transistors** on TSMC's enhanced 4NP process, Blackwell achieves unprecedented compute density. The GB200 NVL72 rack-scale system creates a single **1.4 exaflops** inference platform, enabling real-time processing of trillion-parameter models. ## 3. Rubin Architecture: The future of accelerated computing ### Announced Specifications and Roadmap | **Model** | **Rubin (2026)** | **Rubin Ultra (2027)** | **Vera CPU** | |-----------|------------------|------------------------|--------------| | **Process Node** | TSMC 3nm | TSMC 3nm enhanced | TSMC 3nm | | **Configuration** | Single large die | 4x GPU chiplets | Standalone/integrated | | **Memory** | 288GB HBM4 | 1TB HBM4e | 480GB+ LPDDR5X | | **Memory Bandwidth** | 13 TB/s | 52 TB/s (total) | 512 GB/s+ | | **FP4 Performance** | 50 PFLOPS | 100 PFLOPS/package | N/A | | **FP8 Performance** | 25 PFLOPS | 50 PFLOPS/package | N/A | | **System Scale** | NVL144 (144 GPUs) | NVL576 (576 dies) | 36 CPUs in NVL | | **NVLink** | 7th gen, 3.6 TB/s | 1.5 PB/s system | 1.8 TB/s to GPU | | **CPU Cores** | N/A | N/A | 88 ARM cores | | **Power** | TBD | 600kW+ per rack | TBD | | **Performance vs Blackwell** | 3.3x improvement | 6x+ improvement | N/A | | **Availability** | H2 2026 | H2 2027 | H2 2026 | ### Rubin's transformative vision Rubin marks Nvidia's transition to **3nm manufacturing** and introduces HBM4 memory technology with **13 TB/s bandwidth**. The architecture's tight integration with custom Vera ARM CPUs creates a unified computing platform optimized for AI reasoning and agentic systems. Rubin Ultra's chiplet design with four reticle-sized dies per package pushes boundaries further, delivering **100 petaflops FP4 performance** per package and scaling to 576 GPU dies in a single rack. ## 4. Cross-generation performance comparison ### Computational Performance Evolution | **Metric** | **H100** | **H200** | **B200** | **Improvement** | **Rubin (2026)** | **Total Gain** | |------------|----------|----------|----------|-----------------|------------------|----------------| | **FP64 TFLOPS** | 34 | 34 | 40 | 1.2x | TBD | TBD | | **FP32 TFLOPS** | 67 | 67 | 80 | 1.2x | ~160 | 2.4x | | **TF32 Tensor** | 989 TFLOPS | 989 TFLOPS | 2.5 PFLOPS | 2.5x | ~7.5 PFLOPS | 7.6x | | **FP16/BF16 Tensor** | 1.98 PFLOPS | 1.98 PFLOPS | 5 PFLOPS | 2.5x | ~15 PFLOPS | 7.6x | | **FP8 Tensor** | 3.96 PFLOPS | 3.96 PFLOPS | 10 PFLOPS | 2.5x | 25 PFLOPS | 6.3x | | **FP4 Tensor** | N/A | N/A | 20 PFLOPS | New | 50 PFLOPS | N/A | | **Memory Capacity** | 80GB | 141GB | 192GB | 2.4x vs H100 | 288GB | 3.6x | | **Memory Bandwidth** | 3.35 TB/s | 4.8 TB/s | 8 TB/s | 2.4x vs H100 | 13 TB/s | 3.9x | | **NVLink Speed** | 900 GB/s | 900 GB/s | 1.8 TB/s | 2x | 3.6 TB/s | 4x | | **Power Efficiency** | Baseline | 1.0x | 2.5x | 2.5x better | ~8x | 8x better | ## 5. Architecture improvements between generations ### Hopper to Blackwell transformation The transition from Hopper to Blackwell represents a fundamental architectural shift rather than incremental improvement. Blackwell's **dual-die design** overcomes manufacturing constraints while delivering 2.6x more transistors. The second-generation Transformer Engine with **FP4 precision support** doubles AI performance without accuracy loss. Memory bandwidth increases 2.4x to **8 TB/s**, while NVLink 5.0 doubles interconnect speed to **1.8 TB/s**, enabling unprecedented multi-GPU scaling. ### Blackwell to Rubin evolution Rubin's move to **3nm process technology** enables another generational leap in compute density and efficiency. The integration of custom **88-core Vera ARM CPUs** creates a unified heterogeneous computing platform. HBM4 memory technology delivers **13 TB/s bandwidth** - a 63% increase over Blackwell. Performance improvements of **3.3x over Blackwell** position Rubin for next-generation AI workloads including reasoning systems and autonomous agents. ## 6. Target workloads and specialized capabilities ### Workload Optimization by Architecture | **Workload Type** | **Hopper Strengths** | **Blackwell Advantages** | **Rubin Focus** | |-------------------|---------------------|-------------------------|-----------------| | **LLM Training** | Models up to 175B params | Trillion-parameter models | Multi-modal training | | **LLM Inference** | Production deployment | 30x faster than H100 | Massive context windows | | **Scientific Computing** | Strong FP64 performance | Enhanced double precision | Quantum simulation | | **Database/Analytics** | Standard acceleration | 6x with decompression engine | Real-time analytics | | **Computer Vision** | Excellent CNN performance | Advanced video processing | Multi-modal AI | | **Recommendation Systems** | MIG multi-tenancy | Massive embedding tables | Personalized AI agents | | **Graph Neural Networks** | Good baseline | 2.25x improvement | Complex graph reasoning | ### Special features comparison Each architecture introduces unique capabilities addressing evolving AI requirements. Hopper's **Confidential Computing** and **Multi-Instance GPU** enable secure multi-tenant deployments. Blackwell's **RAS Engine** provides AI-powered predictive maintenance, while the **Decompression Engine** accelerates database workloads by 6x. Rubin's planned **disaggregated architecture** will optimize prefill versus decode processing for inference workloads, with specialized **Rubin CPX** variants for massive-context applications. ## Conclusion Nvidia's data center GPU evolution from Hopper through Blackwell to Rubin demonstrates unprecedented innovation velocity in response to explosive AI demand. The progression shows **900x performance gains** from Hopper to Rubin while reducing costs by 99.97% per operation. Blackwell's imminent arrival with revolutionary dual-die architecture and FP4 support will enable trillion-parameter model deployments, while Rubin's 2026 introduction of HBM4 and 3nm technology establishes the foundation for autonomous AI systems. The shift to an **annual release cadence** ensures continuous technological advancement, with each generation delivering transformative capabilities rather than incremental improvements. For organizations planning AI infrastructure investments, this roadmap provides clear visibility into performance trajectories, enabling strategic decisions aligned with exponentially growing computational requirements of next-generation AI applications.
A system prompt was added to support web rendering