The Architecture of Precision: Variations in model quantizations
Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.
Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.
Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.
This guide covers profiling at the code, framework, and hardware levels using PyTorch and Nsight Systems.
An introduction to optimizations for Large Language Models, covering GPU utilization, precision control, and memory management.