EconPapers    
Economics at your fingertips  
 

No Worker Left (Too Far) Behind: Dynamic Hybrid Synchronization for In‐Network ML Aggregation

Diego Cardoso Nunes, Bruno Loureiro Coelho, Ricardo Parizotto and Alberto Egon Schaeffer‐Filho

International Journal of Network Management, 2025, vol. 35, issue 1

Abstract: Achieving high‐performance aggregation is essential to scaling data‐parallel distributed machine learning (ML) training. Recent research in in‐network computing has shown that offloading the aggregation to the network data plane can accelerate the aggregation process compared to traditional server‐only approaches, reducing the propagation delay and consequently speeding up distributed training. However, the existing literature on in‐network aggregation does not provide ways to deal with slower workers (called stragglers). The presence of stragglers can negatively impact distributed training, increasing the time it takes to complete. In this paper, we present Serene, an in‐network aggregation system capable of circumventing the effects of stragglers. Serene coordinates the ML workers to cooperate with a programmable switch using a hybrid synchronization approach where approaches can be changed dynamically. The synchronization can change dynamically through a control plane API that translates high‐level code into switch rules. Serene switch employs an efficient data structure for managing synchronization and a hot‐swapping mechanism to consistently change from one synchronization strategy to another. We implemented and evaluated a prototype using BMv2 and a Proof‐of‐Concept in a Tofino ASIC. We ran experiments with realistic ML workloads, including a neural network trained for image classification. Our results show that Serene can speed up training by up to 40% in emulation scenarios by reducing drastically the cumulative waiting time compared to a synchronous baseline.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/nem.2290

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:wly:intnem:v:35:y:2025:i:1:n:e2290

Access Statistics for this article

More articles in International Journal of Network Management from John Wiley & Sons
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-20
Handle: RePEc:wly:intnem:v:35:y:2025:i:1:n:e2290