The Architecture of Precision: Variations in model quantizations
Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.
Speed of Thought: Optimizing Gemma 27B with Custom Triton Kernels
Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.
Squeeze every FLOP: Profiling AI models
This guide covers profiling at the code, framework, and hardware levels using PyTorch and Nsight Systems.
Probabilistic Report Cards: LLM Evaluation Metrics
From N-Grams to LLM-as-a-Judge: A deep dive into the evolution of evaluation metrics.
USB-C of AI Space: The Model Context Protocol
MCP is the open standard for connecting AI models to data and tools. Discover how Anthropic’s new protocol solves the $N imes M$ integration problem, creating a plug-and-play ecosystem for AI agents.
Electronic Executives: RAG, ReAct and MCP
A deep dive into the cognitive architectures of modern AI agents, exploring Retrieval-Augmented Generation (RAG), the ReAct reasoning pattern, and the Model Context Protocol (MCP).
The Need For Speed: KV Cache and memory optimization at Inference
An introduction to KV Caching and its role in optimizing Transformer inference.
Crafty Patchwork: Parameter-Efficient Fine-Tuning
An introduction to Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA, QLoRA, and more.
Mission Impossible: Fitting Trillion-Parameter Giants into 80GB GPUs
An introduction to optimizations for Large Language Models, covering GPU utilization, precision control, and memory management.
Anatomy of Trillion-Parameter Switchboards: Understanding Feedforward Blocks
Exploring the hidden layers of trillion-parameter switchboards: Feedforward Neural Networks and Activation Functions.