Study On Optical Communications, Volume. 50, Issue 5, 24004901(2024)
Application of Reconfigurable OCS Technology for Pre-training Large Language Models
Compared to Electronic Packet Switching (EPS), Optical Circuit Switching (OCS) demonstrates advantages in latency, power consumption, cost, and stability. This study aims to explore feasible applications of OCS in the networking of training tasks by analyzing parallel partitioning strategies, collective communication requirements, traffic patterns, and current network architectures in large model pretraining, in order to fully leverage the benefits of OCS.
We propose a mechanism for network device redundancy protection using multiple small-port OCS devices, enabling rapid switching without interrupting training tasks in the event of Top-of-Rack (ToR) switch failures. Additionally, we advocate for the exclusive service of OCS to data parallelism, requiring configuration only at the start of the task.
We present several feasible opto-electronic networking architectures and specific configurations under different AllReduce algorithms, including joint optimization of collective communication algorithms and architectural design to achieve optimal bandwidth.
By adequately integrating the traffic models of training tasks, OCS can seamlessly blend into existing EPS network architectures and optimize the large model pretraining from multiple perspectives, including cost, low power consumption, reduced latency, and enhanced stability.
Get Citation
Copy Citation Text
Chen ZHU, Xu ZHOU, Peilong WANG. Application of Reconfigurable OCS Technology for Pre-training Large Language Models[J]. Study On Optical Communications, 2024, 50(5): 24004901
Category:
Received: Mar. 1, 2024
Accepted: --
Published Online: Oct. 15, 2024
The Author Email: ZHU Chen (zhuchen06@baidu.com)