Online scheduling of coflows by attention-empowered scalable deep reinforcement learning

Xin Wang, Hong Shen

Research output: Contribution to journalArticlepeer-review

Abstract

With the abstraction of parallel data transmission flows being a coflow, data transmissions in large-scale computing jobs can be modeled by a coflow directed acyclic graph (coflow DAG) in which nodes are coflows and edges represent dependencies between coflows. Efficient scheduling of coflows on network links is crucial for reducing the overall communication and job completion time. The known best coflow scheduling method deploying deep reinforcement learning (DRL), DeepWeave (Sun et al., 2020), suffers from poor scalability due to the requirement of O(dn)-size policy network for processing n coflows of d dimensions which is difficult to train. This paper extends the directed acyclic graph neural network (DAGNN) to Pipelined-DAGNN that embeds the features of different stages of input coflow DAGs in pipeline to effectively speed up the feature extraction process. To effectively process the feature vectors of coflow DAGs of arbitrary size and shape without compromising scheduling accuracy (quality), we propose a novel self-attention empowered DRL coflow scheduling model to generate coflow scheduling policies, which enables the scale of policy network depends only on features (dimensions) rather than coflows, without the need of packing all individual embedding vectors from Pipelined-DAGNN into a long flat vector. Our model reduces the size of the policy network in DRL from previously O(dn) to O(d), achieving a high scalability independent of the number of coflows. Simulation results on Facebook trace show that our model reduces the average weighted job completion time by up to 33.88%, apart from being more scalable and robust, compared with the state-of-the-art methods.

Original languageEnglish
Pages (from-to)195-206
Number of pages12
JournalFuture Generation Computer Systems
Volume146
DOIs
Publication statusPublished - Sept 2023

Keywords

  • Attention mechanism
  • Coflow scheduling
  • Graph neural network
  • Parallel computing
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'Online scheduling of coflows by attention-empowered scalable deep reinforcement learning'. Together they form a unique fingerprint.

Cite this