How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance
going from naive to 94% of cuBLAS, one kernel at a time
Hi! I'm Robert, an EE student at Purdue University passionate about optimized compute systems for AI. I'm interested in the intersection of compilers, ML systems, and hardware that turn models into fast usable systems at scale. Currently exploring accelerator operator libraries, compiler-hardware codesign, and agent-based systems. Building side projects with passion. Love all things sports, nature, and jazz related.
Packed INT4 GEMM kernel in Triton for decode-heavy LLM inference with hardware counter attribution and a regime model for when quantization helps. Up to 3.7× speedup on A10G.
Fully pipelined 5-stage RISC-V processor in SystemVerilog with hazard detection, data forwarding, and a 2-bit branch predictor. Synthesized in Cadence targeting 70 MHz.
A scheduler for ML accelerator computation DAGs that minimizes latency while respecting tight on-chip SRAM capacity constraints. Google MLSys 2026, Track A.
Research and industry roles spanning hardware, systems, and applied AI.
going from naive to 94% of cuBLAS, one kernel at a time
loop reordering, tiling, and multithreading
distributed training via model partitioning
Reach me at robertzhang930@gmail.com for business inquiries or just to say hi. Connect with me on the platforms below.