AI’s Future Is HPC’s Present (With Better Marketing)
Disclaimer: These are loosely connected thoughts that have been floating around in my head for a while. They're not fully formed or rigorously argued, but they represent the start of a perspective I want to develop further. I hope this post can serve as a springboard for discussion and future writing.
Artificial Intelligence is often framed as a revolution. a radical departure from traditional computing paradigms. But in many ways, it's more of a reincarnation: a re-manifestation of the decades-long discipline of High-Performance Computing (HPC). As AI workloads, especially large-scale training and inference for foundation models, stretch hardware and software infrastructure to their limits, they increasingly resemble classic HPC workloads. The twist? They arrive at old problems from new directions, often without the rich latency-tolerant ecosystem HPC has painstakingly built.
Training: AI Embraces Bulk-Synchronous HPC
AI training, from the beginning, has leaned heavily on bulk-synchronous paradigms:
Forward pass
Backward pass
All-Reduce
Optimizer step
Each of these stages runs across dozens or thousands of accelerators in lockstep. This is essentially MPI-style SPMD (Single Program, Multiple Data) with collective synchronization. This model works well for training where compute dominates communication, and latency can be amortized across large batches and long iterations.
But this strategy starts to falter when applied to inference workloads.
Inference: Latency Sensitivity Changes Everything
Inference flips the script. Instead of amortizing latency across large batches, systems now must respond in real time to individual user queries. That means end-to-end latency isn't just a metric; it's the metric. The user is waiting on the other end.
Now, throughput (how many tokens or queries per second the system can handle) and latency (how fast an individual request completes) are in direct tension. Jensen Huang's often-discussed pareto curves capture this tradeoff:
Large batches = high throughput per GPU
Small batches (or batch size 1) = high throughput per user
This tension defines the core optimization challenge in modern inference infrastructure.
HPC Has Been Here Before
Unlike AI, HPC has been dealing with long latencies for decades:
Memory latencies that stretch to hundreds of cycles
Network latencies measured in tens of microseconds
In response, HPC developed a wide range of strategies for tolerating or hiding latency:
SHMEM-style one-sided communications
Fork-join structured concurrency
Asynchronous execution models
Device-side signal/wait and polling (e.g. Lamport clocks, active messaging)
These tools allowed forward progress even when data dependencies spanned long-distance memory or network hops.
HPC systems of the past also operated under harsher hardware constraints. The compute-to-memory frequency ratio was often far more imbalanced than it is today. CPUs frequently ran at many times the frequency of attached DRAM, resulting in deep stalls on cache misses. Today, memory technologies like DDR5 and HBM have narrowed this gap significantly, sometimes approaching CPU clock rates and providing much higher bandwidth.
Moreover, early HPC lacked the rich ecosystem of libraries and frameworks that today's AI benefits from. There was no cuBLAS, NCCL, or PyTorch; developers had to write bespoke kernels and communication patterns from scratch. Latency tolerance wasn't just a matter of performance optimization—it was survival. That constraint led to the development of techniques that still hold valuable lessons for AI infrastructure today. They emerged not from theory, but from necessity.
Tensor Parallelism: Local HPC
Human perception introduces an additional wrinkle. As noted by usability expert Jakob Nielsen, users perceive interactions as "instant" if they occur within 100 milliseconds. If a response takes longer than 1 second, the user notices the delay. Beyond about 10 seconds, attention begins to drift. And responses in the awkward middle range (long enough to feel slow but not long enough to justify leaving and coming back) create a sense of friction and user frustration. For autoregressive AI systems, which may need to stream generated tokens in real time, staying under that perceptual threshold can be the difference between delight and disuse.
Interestingly, modern AI training techniques like tensor parallelism already recapitulate some of these issues—but in a limited scope. Communication between GPUs over NVLink or PCIe introduces just enough latency (single-digit microseconds) to force consideration of data movement and computation overlap. Yet the domain is small, often within a single node or chassis - at least for now.
Inference, on the other hand, must eventually scale beyond these local domains.
Emerging Signs: Latency-Tolerant AI Inference
Large monolithic models, such as those used for autoregressive generation, present particular challenges for latency-sensitive inference. Despite the emergence of sparse and modular architectures like Mixture-of-Experts (MoE), model scaling is far from over. Most leading model developers are actively building or experimenting with trillion-parameter models, and even sparsely activated models still require enormous memory footprints for storing expert weights and routing logic. These growing demands continue to stress communication bandwidth, memory hierarchies, and scheduling systems alike. These models tend to require coordinated access to many parameters across large memory spaces and multiple devices, making it difficult to return outputs quickly when operating in a distributed environment. Their size and sequential nature often make it hard to break up work in a way that allows for fine-grained scheduling or overlapping of communication and computation. As a result, achieving low end-to-end latency while maintaining high throughput is a significant engineering challenge, especially when serving models with tens or hundreds of billions of parameters.
Recent research and open-source systems are starting to move AI inference closer to HPC-style latency tolerance:
Expert parallelism (best represented by ongoing the DeepSeek frenzy) design uses point-to-point communication (not collective all-reduce) for model partitioning, enabling different experts to make progress independently.
A2A (All-to-All) patterns in MoE inference are now implemented with non-blocking communications to allow partial forward progress.
Fused kernels are emerging that perform both communication and computation, initiated and coordinated directly by the GPU, reducing host round-trips and improving data-flow overlap.
These are first steps toward more dynamic, latency-tolerant execution environments.
The Next Step: Asynchrony and Graph-Based Execution
Inference frameworks like vLLM and upcoming PyTorch extensions will need to evolve:
Move away from bulk-synchronous all-reduce patterns
Embrace asynchronous, graph-driven execution
Allow partial execution and early consumption of results
Let device kernels initiate communication directly
In short, they must start to resemble mini operating systems for large, latency-sensitive, multi-GPU workloads.
A key part of this evolution involves fine-grained, data-driven communication and scheduling. Accelerators and CPUs need the ability to consume and act on data as soon as it's available, without waiting for global synchronization points. This can be enabled through asynchronous data-flow graphs that dynamically schedule computation based on the readiness of inputs. Rather than rigid, bulk-synchronous kernels, execution becomes a fluid, event-driven system where partial results can propagate forward and initiate dependent computations early. This kind of approach not only improves latency but also leads to better utilization and resilience under uneven or sparse workloads.
Hardware and Software Opportunity
There is massive opportunity in this space:
New hardware designs that support device-initiated communication and fine-grained scheduling
Runtime systems that expose structured concurrency and non-blocking progress
Programming models that allow expressing latency-tolerant logic naturally
Just as SHMEM, UPC, and GASNet evolved to address latency in scientific computing, inference-serving platforms will need new tools and abstractions.
We're already seeing the future hinted at in new systems architectures. Platforms like NVIDIA's Kyber rack aim to create very large, tightly interconnected NVLink domains, moving beyond the single-node constraint that has traditionally bounded high-bandwidth communication. These architectures promise more seamless data movement and coordination across dozens of accelerators.
Simultaneously, there's a trend toward tighter CPU-accelerator integration. Examples include NVIDIA Grace, AmpereOne, Tenstorrent's hybrid designs, and other custom SoCs that integrate general-purpose CPUs with high-performance accelerators on the same memory fabric. These tightly coupled architectures reduce communication overhead and make it easier to implement fine-grained, latency-sensitive scheduling. If the CPU and GPU can both operate over shared memory and trigger work on each other with low overhead, a new level of responsiveness becomes possible.
A Contrast: Google-Scale Compute
Interestingly, not all compute-intensive workloads embrace HPC-style latency tolerance. Consider large-scale web services like Google Search:
These rely on massive scale-out compute
But concurrency is often at the request level, not intra-request
Synchronization can be hidden behind service boundaries and queues
This model works because the tasks are more decomposable. AI inference, especially autoregressive generation, is fundamentally sequential.
Conclusion
AI is becoming a re-manifestation of HPC out of necessity. Inference is where the differences truly emerge, pushing AI infrastructure toward a richer set of techniques long practiced in HPC. Those who understand both domains will be uniquely positioned to shape the next generation of AI platforms, where latency isn’t just tolerated, it's embraced.
Looking forward, we can draw inspiration from historical HPC systems like IBM's BlueGene, which used custom interconnects and hardware-aware software models to push the limits of scalability and efficiency. AI inference may soon benefit from similarly bespoke networking layers, optimized for low-latency point-to-point communication rather than bulk collective operations. Likewise, there's a growing case for integrating fixed-function hardware elements (token schedulers, routing engines, or even hardware-accelerated attention mechanisms) to better match the execution model of large-scale inference.
In a world where computation and communication are increasingly entangled, the boundaries between compiler, runtime, and hardware will blur. The most successful AI systems will be co-designed from the ground up, embracing asynchronous, latency-tolerant paradigms. This is not a departure from HPC. It's the continuation of its most fundamental principles, updated for the neural age.