Projects

Product work, experiments, and research with focused notes and visuals.

The Hidden Bottleneck in MLA Serving: Reconstruction GEMMs and the L2 Cache Barrier

Profiling MLA attention on H100 reveals reconstruction GEMMs consume 61% of attention-layer time. INT4 quantization should help but doesn't, because the weights fit in L2 cache.

Tiny-GEMM: Packed INT4 Triton GEMM for Decode-Heavy LLM Inference

Packed INT4 GEMM kernel in Triton for decode-heavy LLM inference with hardware counter attribution and a regime model for when quantization helps. Up to 3.7× speedup on A10G.

FlashAttention & Kernel Development on Atalla Ax01

Kernel development and HW/SW co-design for Atalla, a student-built weight-stationary systolic array AI accelerator. FlashAttention mapping, im2col convolution, tiled GEMM, and PyTorch backend integration.

5-Stage Pipelined RISC-V CPU

Fully pipelined 5-stage RISC-V processor in SystemVerilog with hazard detection, data forwarding, and a 2-bit branch predictor. Synthesized in Cadence targeting 70 MHz.

Memory-Constrained Graph Scheduling

A scheduler for ML accelerator computation DAGs that minimizes latency while respecting tight on-chip SRAM capacity constraints. Google MLSys 2026, Track A.

Sparse Attention Kernel for DeepSeek V3.2 on B200

A fused CUDA kernel implementing DeepSeek Sparse Attention on NVIDIA B200 (Blackwell). FlashInfer AI Kernel Generation Contest, MLSys 2026.

VeriGen: Agents for Accelerated Chip Design

Integrated RTL design verification tool for testbench generation, script and trace analysis with multi-agent collaboration for accelerated RTL development.