The Architecture of Precision: Variations in model quantizations

Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.

April 6, 2026 · 10 min

Speed of Thought: Optimizing Gemma 27B with Custom Triton Kernels

Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.

April 5, 2026 · 10 min

Squeeze every FLOP: Profiling AI models

This guide covers profiling at the code, framework, and hardware levels using PyTorch and Nsight Systems.

March 29, 2026 · 4 min

Mission Impossible: Fitting Trillion-Parameter Giants into 80GB GPUs

An introduction to optimizations for Large Language Models, covering GPU utilization, precision control, and memory management.

January 11, 2026 · 8 min