Kernels

Speed of Thought: Optimizing Gemma 27B with Custom Triton Kernels

Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.