Vectors & Verbsby Siddhesh Inamdar

Exploring the intersection of human creativity and machine intelligence.

The Architecture of Precision: Variations in model quantizations

Summary of how quantization bridges the gap between trillion-parameter models and the hardware they run on, and why ‘smaller’ is almost always ‘faster’.

Speed of Thought: Optimizing Gemma 27B with Custom Triton Kernels

Profiling Gemma 27B on dual RTX 5090 Blackwell GPUs, discovering a legacy CUDA path bottleneck, and building a custom Triton kernel that shaves 1.1 seconds off every inference batch.

Squeeze every FLOP: Profiling AI models

This guide covers profiling at the code, framework, and hardware levels using PyTorch and Nsight Systems.

Probabilistic Report Cards: LLM Evaluation Metrics

From N-Grams to LLM-as-a-Judge: A deep dive into the evolution of evaluation metrics.

USB-C of AI Space: The Model Context Protocol

MCP is the open standard for connecting AI models to data and tools. Discover how Anthropic’s new protocol solves the $N imes M$ integration problem, creating a plug-and-play ecosystem for AI agents.

Electronic Executives: RAG, ReAct and MCP

A deep dive into the cognitive architectures of modern AI agents, exploring Retrieval-Augmented Generation (RAG), the ReAct reasoning pattern, and the Model Context Protocol (MCP).

The Need For Speed: KV Cache and memory optimization at Inference

An introduction to KV Caching and its role in optimizing Transformer inference.

Crafty Patchwork: Parameter-Efficient Fine-Tuning

An introduction to Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA, QLoRA, and more.

Mission Impossible: Fitting Trillion-Parameter Giants into 80GB GPUs

An introduction to optimizations for Large Language Models, covering GPU utilization, precision control, and memory management.

Anatomy of Trillion-Parameter Switchboards: Understanding Feedforward Blocks

Exploring the hidden layers of trillion-parameter switchboards: Feedforward Neural Networks and Activation Functions.