Trends shaping the future of AI infrastructure

Subscribe • Previous Issues

The PARK Stack Is Becoming the Standard for Production AI

In a previous article, I argued that the open-source project Ray has become the compute substrate many modern AI platforms are standardizing on — bridging model development, data pipelines, training, and serving without locking into a single vendor. Ray Summit is my favorite venue for pressure-testing that thesis because it’s where infrastructure and platform teams show real systems, real constraints, and the trade-offs they’re making: how they’re scheduling scarce GPUs, wiring multimodal data flows, hardening reliability on flaky hardware, and speeding the post-training loop that now drives most gains. This year’s event was no exception, providing a clear signal of the key patterns shaping the next generation of AI systems. What follows is a synthesis of those observations, covering critical shifts in how teams are handling models, data, and workloads; managing scarce resources like GPUs; and building reliable, production-grade operations on a unified compute fabric.


Regular reader? Consider becoming a paid supporter 🙏


Models, Data & Workloads

Distributed inference replaces “one-GPU serving”.  Serving large and mixture-of-experts models is now a distributed systems problem. This new standard of “distributed inference” involves intricate orchestration for tasks like splitting computation between prompt processing (prefill) and token generation (decode), routing requests to different “expert” models on different GPUs, and managing the transfer of key-value caches between nodes. This complexity is now the baseline for deploying frontier models in production.

  • Ray tie-in. Ray’s core actor model allows for the precise placement of and communication between different parts of a model running on separate hardware. Joint work with the vLLM community enables advanced routing and parallelism.

Post-training and reinforcement learning take center stage. The biggest improvements now come after pre-training: alignment, fine-tuning, and reinforcement learning that turns evaluation signals into model updates. For instance, the agentic coding platform Cursor uses reinforcement learning as a core part of its stack to refine its models, while Physical Intelligence applies RL to train generalist policies for robotics. For AI teams, the work is reward modeling, data curation from live traffic, and iterating many small variants quickly — not just more pre-training compute.

  • Ray tie-in. Ray was originally developed to manage the complex and dynamic compute patterns inherent to reinforcement learning. As RL becomes a central component in the post-training pipeline for foundation models, Ray is uniquely suited to manage these workloads. Its actor model effectively coordinates the distinct processes of data generation, reward modeling, and model updates. Consequently, nearly every major open-source post-training framework is built on Ray.

Serving frontier models is now a distributed systems problem.

Multimodal data engineering becomes first-class. AI data pipelines are rapidly evolving beyond text-only workloads to process a diverse and massive mix of data types, including images, video, audio, and sensor data. This transition makes the initial data processing stage significantly more complex, as it often requires a combination of CPUs for general transformations and GPUs for specialized tasks like generating embeddings. This means data processing is no longer a simple, CPU-based ETL task but a sophisticated, heterogeneous distributed computing problem in its own right.

  • Ray tie-in. Ray is positioned as the compute engine for these demanding multimodal workloads. Its ability to dynamically orchestrate tasks across a heterogeneous cluster of CPUs and GPUs is essential for building efficient data pipelines. The Ray Data library has been specifically enhanced to handle large tensors and diverse data formats.
See also  Hong Kong Extends Digital Bond Ambitions With Third Offering

Agentic workflows and continuous loops. Applications are shifting from single calls to systems that plan, invoke tools/models, check results, and learn from feedback — continuously. These loops span data collection, post-training, deployment, and evaluation. For enterprises, building agentic applications means infrastructure must support coordinating long-running workflows across these stages rather than just running isolated training jobs or inference endpoints. The benefit is faster product learning cycles, not a single “perfect” model.

  • Ray tie-in. Ray’s actor model supports long-lived agents; tasks and queues coordinate tool use and evals; and the same cluster runs data prep, training, and serving so teams don’t glue together multiple platforms.
Resource Management & Cloud Strategy

Global GPU scheduling and cost control. GPU capacity is too valuable to sit idle. Statically partitioning a fixed pool of GPUs among competing teams and workloads — such as production inference, research training, and batch processing — is highly inefficient. AI teams report materially higher utilization, lower costs, and faster developer startup times by using a policy-driven scheduler that can preempt low-priority jobs during traffic spikes and resume them later. The business outcome is straightforward: more capacity pointed at the most valuable work, less waste, and fewer blocked projects.

  • Ray tie-in. Anyscale’s platform addresses this with a global resource scheduler built on Ray. This scheduler provides a centralized, workload-aware system for managing constrained resources across an entire organization. It operates across all Ray clusters in an organization, understanding workloads, reservations, and priorities to make allocation decisions. 

Cloud-native and multi-cloud, without lock-in. GPU scarcity is driving enterprises to multi-cloud and multi-provider strategies. Rather than relying on a single cloud provider’s GPU availability, companies are distributing workloads across AWS, Google Cloud, Azure, and specialized GPU clouds like CoreWeave and Lambda Labs. This approach addresses both availability (accessing capacity wherever it exists) and negotiating leverage (avoiding single-vendor lock-in for expensive resources). However, multi-cloud introduces complexity: different APIs, networking configurations, and operational tooling across providers. 

  • Ray tie-in. Ray/Anyscale provides a common runtime across AWS, GCP, Azure, and specialty GPU clouds. The same Ray code runs everywhere; the platform layer handles identities, networking, storage, and scheduling so teams can chase capacity without rebuilding systems.
Source: ClickPy;  Ray is in the Top 1% of all projects based on PyPI downloads.
Operations & Reliability

Evaluation-driven operations for non-deterministic systems. Developing AI products is fundamentally different from traditional software engineering. Unlike deterministic code, AI models are non-deterministic systems whose behavior can drift in production. This reality invalidates the traditional “perfect and ship” development model. The teams that win run continuous evaluations tied to product metrics and feed results into post-training. Iteration speed — collect, retrain, redeploy, re-measure — is a moat.

  • Ray tie-in. Ray hosts the full loop on one substrate: data collection, eval jobs, training runs, and rollouts reuse the same primitives. The platform can run long-lived evaluation workloads alongside training and serving, with shared access to models and data. Ray actors maintain state across evaluation runs, enabling sophisticated monitoring patterns.
See also  Cyberattack Cripples Asahi Operations, Lifts Rival Brewers

Reliability at scale on unreliable hardware. Operating AI infrastructure at scale means designing for failure. Long-running training jobs, which can last for weeks, must be resilient to hardware faults to avoid losing progress. This reality requires that production systems incorporate robust fault tolerance, including automatic retries, job checkpointing, and graceful handling of worker failures, to ensure that long jobs and always-on services can continue uninterrupted.

  • Ray tie-in. Ray has made significant investments in reliability and fault tolerance. Its internal state management system has been re-architected for high availability, and system processes are now better isolated from application resource pressure to prevent instability. Ray’s support for checkpointing is critical for long-running training jobs, enabling them to be paused and resumed seamlessly, which is essential when using preemptible spot instances.
The PARK Stack: The LAMP Stack for the AI Era
Infrastructure & Compute Fabric

Heterogeneous clusters are the baseline. CPU-only data prep and single-GPU serving are obsolete. Pipelines blend CPUs (parsing, aggregation) with GPUs (embeddings, vision/audio transforms) across many nodes. 

  • Ray tie-in. Ray was designed to handle dynamic orchestration across heterogeneous hardware. Its core architecture allows developers to declaratively specify the resource requirements for each task, and the runtime handles the complex scheduling and placement across the available CPUs and GPUs. This native support for heterogeneous clusters is a primary reason for its growing adoption.

Accelerators plus fast interconnects determine throughput. Purpose-built AI data centers with specialized accelerators connected via high-speed networking technologies are becoming standard infrastructure, fundamentally changing how compute resources must be managed. This represents a shift from general-purpose cloud computing to specialized infrastructure where the interconnect between accelerators is as critical as the accelerators themselves. 

  • Ray tie-in. Ray Direct Transport enables direct GPU-to-GPU transfers between actors with a minimal code change, improving utilization for RL, distributed inference, and multimodal training without rewriting applications. By providing native support for RDMA and high-speed interconnects, Ray allows applications to fully utilize the bandwidth available in modern AI data centers. 

The PARK Stack.  A stack is coalescing into clear layers with active collaboration at the seams. It consists of co-evolving layers: a container orchestrator like Kubernetes for provisioning resources; a distributed compute engine like Ray for scaling applications and handling systems challenges like fault tolerance; AI (foundation models); and a high-level framework like PyTorch for model development or refinement.

  • Ray tie-in. Ray is positioned as the compute engine in this platform: it unifies data processing, training and post-training, and distributed inference into one operational substrate and plugs into model stacks and Kubernetes. The move to join the PyTorch Foundation signals tighter, community-led integration with the training/serving ecosystem. Ray’s maintainers co-develop features with adjacent projects (e.g., vLLM for serving, Kubernetes for autoscaling/isolation, PyTorch for training).


Inspired by   “Beyond the Machine: Creative agency in the AI landscape”

Modified MIT license requires “Kimi K2” attribution for large commercial deployments, fueling debate over open-source authenticity.

The post Trends shaping the future of AI infrastructure appeared first on Gradient Flow.