TY - JOUR
T1 - Online scheduling of coflows by attention-empowered scalable deep reinforcement learning
AU - Wang, Xin
AU - Shen, Hong
N1 - Publisher Copyright:
© 2023
PY - 2023/9
Y1 - 2023/9
N2 - With the abstraction of parallel data transmission flows being a coflow, data transmissions in large-scale computing jobs can be modeled by a coflow directed acyclic graph (coflow DAG) in which nodes are coflows and edges represent dependencies between coflows. Efficient scheduling of coflows on network links is crucial for reducing the overall communication and job completion time. The known best coflow scheduling method deploying deep reinforcement learning (DRL), DeepWeave (Sun et al., 2020), suffers from poor scalability due to the requirement of O(dn)-size policy network for processing n coflows of d dimensions which is difficult to train. This paper extends the directed acyclic graph neural network (DAGNN) to Pipelined-DAGNN that embeds the features of different stages of input coflow DAGs in pipeline to effectively speed up the feature extraction process. To effectively process the feature vectors of coflow DAGs of arbitrary size and shape without compromising scheduling accuracy (quality), we propose a novel self-attention empowered DRL coflow scheduling model to generate coflow scheduling policies, which enables the scale of policy network depends only on features (dimensions) rather than coflows, without the need of packing all individual embedding vectors from Pipelined-DAGNN into a long flat vector. Our model reduces the size of the policy network in DRL from previously O(dn) to O(d), achieving a high scalability independent of the number of coflows. Simulation results on Facebook trace show that our model reduces the average weighted job completion time by up to 33.88%, apart from being more scalable and robust, compared with the state-of-the-art methods.
AB - With the abstraction of parallel data transmission flows being a coflow, data transmissions in large-scale computing jobs can be modeled by a coflow directed acyclic graph (coflow DAG) in which nodes are coflows and edges represent dependencies between coflows. Efficient scheduling of coflows on network links is crucial for reducing the overall communication and job completion time. The known best coflow scheduling method deploying deep reinforcement learning (DRL), DeepWeave (Sun et al., 2020), suffers from poor scalability due to the requirement of O(dn)-size policy network for processing n coflows of d dimensions which is difficult to train. This paper extends the directed acyclic graph neural network (DAGNN) to Pipelined-DAGNN that embeds the features of different stages of input coflow DAGs in pipeline to effectively speed up the feature extraction process. To effectively process the feature vectors of coflow DAGs of arbitrary size and shape without compromising scheduling accuracy (quality), we propose a novel self-attention empowered DRL coflow scheduling model to generate coflow scheduling policies, which enables the scale of policy network depends only on features (dimensions) rather than coflows, without the need of packing all individual embedding vectors from Pipelined-DAGNN into a long flat vector. Our model reduces the size of the policy network in DRL from previously O(dn) to O(d), achieving a high scalability independent of the number of coflows. Simulation results on Facebook trace show that our model reduces the average weighted job completion time by up to 33.88%, apart from being more scalable and robust, compared with the state-of-the-art methods.
KW - Attention mechanism
KW - Coflow scheduling
KW - Graph neural network
KW - Parallel computing
KW - Reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85158862956&partnerID=8YFLogxK
U2 - 10.1016/j.future.2023.04.020
DO - 10.1016/j.future.2023.04.020
M3 - Article
AN - SCOPUS:85158862956
SN - 0167-739X
VL - 146
SP - 195
EP - 206
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -