Memory-Constrained Graph Scheduling
Given a computation DAG of MatMul and pointwise ops, produce an execution schedule that minimizes total latency while never exceeding fast-memory (SRAM) capacity. The solver is written in C++ and implements fusion with cycle detection, granularity search, zig-zag tile ordering, retention decisions, and recomputation.
The Problem
Modern AI accelerators live with a hard constraint: fast on-chip SRAM is tiny, neural network tensors are not. Compute can only happen on data currently in fast memory — everything else sits in slow HBM, bandwidth-limited and expensive to move. A scheduler's job is to decide how data flows between those levels to minimize total execution latency.
The competition hands you a computation DAG — MatMul and pointwise ops (ReLU, addition, etc.) — and asks: produce an execution schedule that minimizes total latency without ever exceeding fast-memory capacity. You control how ops are grouped, how tiles are sized, which direction tiles are traversed, and which tensors stay resident in fast memory.
Benchmarks: 5 graphs (5–103 ops each, numbered 1, 5, 9, 13, 17). Scored as geometric mean speedup over a naive unfused baseline. Deadline: April 2026.
The Latency Model
Each tile's execution time is determined by whichever of compute or memory is the bottleneck:
Total schedule latency is the sum over all tiles across all subgraphs. The optimizer's goal is to push each tile toward the compute-bound side of the roofline — large enough to amortize memory transfer, small enough to fit in SRAM.
The Four Optimization Levers
1. Fusion
When consecutive ops are fused into one subgraph, their shared intermediate tensor becomes ephemeral — computed and consumed entirely in fast memory, never touching HBM. This eliminates both the capacity cost of holding the tensor and the bandwidth cost of a round-trip through HBM. The constraint: all ops in a fused subgraph share a single tile granularity, so aggressive fusion can force a suboptimal granularity compromise.
2. Granularity [w, h, k]
Controls spatial tiling of the output tensor and the k-dimension partitioning of MatMuls. Larger tiles have better arithmetic intensity (fewer tiles, less overhead) but need more SRAM. Split-K — small k — is particularly useful for chained MatMuls: streaming thin slices of B and C keeps the accumulator resident in fast memory, enabling fusion of chains that would OOM at larger k.
3. Traversal Order
For MatMul C = A × B tiled into a grid, traversal order determines which tiles share SRAM-resident data. Raster order (left-to-right, top-to-bottom) reuses A strips but reloads B columns on each row. Zig-zag alternates direction on each row, reusing B strips across adjacent rows — cutting unique HBM loads by up to 50% for typical shapes.
4. Recomputation
In diamond/skip-connection patterns (ubiquitous in transformers), a tensor produced by one op is consumed by two downstream ops in different subgraphs. The options: spill it to HBM (pay bandwidth to store + reload), retain it in SRAM (pay capacity to hold it), or recompute it in each consuming subgraph (pay compute to avoid the HBM trip). For cheap pointwise producers, recomputation is often the right call — same idea as FlashAttention's recomputation of attention weights in the backward pass.
The Solver
The solver is implemented in C++. Fusion is a greedy hill-climber: evaluate all adjacent subgraph pairs, score each candidate merge by estimated latency reduction, commit the highest-scoring merge that passes the capacity check, and repeat until no beneficial merge remains. Before committing any merge, the solver runs a BFS cycle check — merging two subgraphs can create a cycle in the subgraph DAG, which is a correctness violation.
After fusion, each subgraph gets an independent granularity search: iterate tile sizes from large to small, pick the largest [w, h, k] that fits within SRAM budget, using the roofline model to bias toward compute-bound configurations. Tile order defaults to zig-zag for all MatMul subgraphs; retention decisions are made greedily based on next-use distance and current capacity pressure.