Cuda | Vectors & Verbs

Speed of Thought: Optimizing Gemma 27B with Custom Triton Kernels

Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.

Squeeze every FLOP: Profiling AI models

This guide covers profiling at the code, framework, and hardware levels using PyTorch and Nsight Systems.