robert zhang

About

I'm an undergraduate studying electrical engineering at Purdue University. I'm broadly interested in the intersection of compilers, ML systems, and hardware — the stack that turns models into fast, usable systems at scale.

Most of my recent work has been in GPU kernel engineering: writing Triton and CUDA kernels for LLM inference, profiling with NCU and rocprof, and figuring out when quantization actually helps versus when it doesn't. I think a lot about memory hierarchies, arithmetic intensity boundaries, and the gap between roofline predictions and real hardware behavior.

I grew up in the Bay Area and am currently based in Santa Clara for the summer. Outside of engineering I love jazz, being outdoors, and all things sports.

Experience

I'm currently at Astera Labs working on applied AI. Before this, I spent seven months at SanDisk as an Advanced Memory Intern, where I built an ML trim optimization platform using XGBoost and configurable DNNs to predict read-window outcomes across process corners. I scaled the data pipelines to 48TB+ and integrated agentic LLM orchestration to automate experiments and synthesize metrics into reports.

I've been a researcher at Purdue SoCET since 2024. I led hardware-software codesign for the Atalla ML accelerator — building the kernel library (FlashAttention, fused GEMM, implicit im2col conv, layernorm), the compiler pipeline (loop unrolling, scheduling, packetization), and a PyTorch graph backend with tiling and fusion optimizations for end-to-end ViT and GPT-2 inference. I also implemented an instruction packetization pass for the Cardinal GPU compiler.

In the summer of 2024, I designed and taped out a wireless messaging ASIC in SKY130 through Purdue's STARS program — GPIO, Wishbone bus control, maskable interrupts, timing closure at 10 MHz using OpenLane, validated via FPGA prototyping.

At Endian (backed by Susa Ventures), I built an automation platform orchestrating 20 concurrent browser agents with a security-first credential flow, saving users 6+ hours per week.

I'm a Comma Capital Fellow focused on frontier systems and AI infrastructure. Earlier, I spent time at the Stanford Cornfield Lab and IBM Almaden as a research intern.

Projects

The Hidden Bottleneck in MLA Serving: Reconstruction GEMMs and the L2 Cache Barrier

Tiny-GEMM: Packed INT4 Triton GEMM for Decode-Heavy LLM Inference

FlashAttention & Kernel Development on Atalla Ax01

5-Stage Pipelined RISC-V CPU

Memory-Constrained Graph Scheduling

Sparse Attention Kernel for DeepSeek V3.2 on B200

VeriGen: Agents for Accelerated Chip Design

Writing

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Contact

Reach me at robertzhang930@gmail.com for anything. I'm also on GitHub, LinkedIn, and X.