The Hidden Bottleneck in MLA Serving: Reconstruction GEMMs and the L2 Cache Barrier
Profiling MLA attention on H100 reveals reconstruction GEMMs consume 61% of attention-layer time. INT4 quantization should help but doesn't, because the weights fit in L2 cache.