The Architecture of Precision: Variations in model quantizations
Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.
Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.
Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.
An introduction to optimizations for Large Language Models, covering GPU utilization, precision control, and memory management.