Cisco Research
A scalable, fault-tolerant extension of collective communication libraries (CCLs) — designed for the next generation of ML inference.
MultiWorld is a research framework developed by Cisco Research that extends traditional collective communication libraries (CCLs) like NCCL to support elasticity and fault tolerance — two essential properties for scalable model serving. While not a model-serving framework itself, MultiWorld enables model-serving systems to dynamically scale and remain resilient in the face of failures.
Modern machine learning workloads, especially inference for large-scale models, have unique demands that traditional CCLs were not designed to meet:

(a) Model partitioning

(b) Normal state

(c) Worker failure

(d) Failure recovery
The figure above shows a conceptual overview of how MultiWorld enables elastic and resilient model serving through flexible deployment and dynamic communication. In (a), a machine learning model is decomposed into multiple partitions, each representing a stage in the inference pipeline. We assume that there exists a model partitioner. In (b), these partitions are then deployed across multiple worker processes, which handle different parts of the computation. For example, the middle partition is deployed redundantly across two workers to balance load and increase throughput.
To facilitate communication among these distributed workers, MultiWorld forms logical process groups, known as "worlds," between pairs of connected workers. Each world operates independently, allowing the system to route data through multiple execution paths and better tolerate imbalances or failures (as shown in (c)). This fine-grained mapping of model stages to workers, and workers to process groups, forms the backbone of MultiWorld’s support for horizontal scaling and fault isolation.
By decoupling model partitioning from physical deployment and introducing multiple communication groups, MultiWorld enables flexible, robust serving pipelines that can scale out bottlenecked stages and recover from failures without interrupting the broader service (as shown in (d)).
MultiWorld is implemented as an extension to PyTorch's distributed infrastructure and includes:
This architecture enables robust communication patterns while preserving high GPU utilization and performance.

MultiWorld architecture
MultiWorld has been evaluated across GPU-to-GPU and host-to-host scenarios using NCCL and PyTorch. Key findings: