Speed of Thought: Optimizing Gemma 27B with Custom Triton Kernels

Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.

April 5, 2026 · 10 min

Squeeze every FLOP: Profiling AI models

This guide covers profiling at the code, framework, and hardware levels using PyTorch and Nsight Systems.

March 29, 2026 · 4 min