TY - GEN
T1 - Distributed Hierarchical Sentence Embeddings for Unsupervised Extractive Text Summarization
AU - Huang, Guanjie
AU - Shen, Hong
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/5/22
Y1 - 2021/5/22
N2 - Unsupervised text summarization is a promising approach that avoids human efforts in generating reference summaries, which is particularly important for large-scale datasets. To improve its performance, we propose a hierarchical BERT [1] model that contains both word-level and sentence-level training processes to achieve semantic-rich sentence embeddings. We use the vanilla BERT as the word-level training, and redesign it for the sentence-level training with the new "Sentence Token Prediction"and "Local Shuffle Recovery"training tasks and suitable input format. We first train word-level model to get preliminary sentence embeddings, then we input them into the sentence-level model to further extract higher level and inter-sentence semantic information. After that, we obtain the context sensitive sentence embeddings and utilize them for the KMeans cluster algorithm to finally generate summaries by extracting sentences from the document. To accelerate the training of the BERT model, we adopt the PipeDream [2] model parallelism that distributes the model layers among multiple machines to conduct the training process in parallel. Finally, we show through experimental results that our proposed model outperforms most popular models and achieves a speedup of 2.7 in training time on 4 machines.
AB - Unsupervised text summarization is a promising approach that avoids human efforts in generating reference summaries, which is particularly important for large-scale datasets. To improve its performance, we propose a hierarchical BERT [1] model that contains both word-level and sentence-level training processes to achieve semantic-rich sentence embeddings. We use the vanilla BERT as the word-level training, and redesign it for the sentence-level training with the new "Sentence Token Prediction"and "Local Shuffle Recovery"training tasks and suitable input format. We first train word-level model to get preliminary sentence embeddings, then we input them into the sentence-level model to further extract higher level and inter-sentence semantic information. After that, we obtain the context sensitive sentence embeddings and utilize them for the KMeans cluster algorithm to finally generate summaries by extracting sentences from the document. To accelerate the training of the BERT model, we adopt the PipeDream [2] model parallelism that distributes the model layers among multiple machines to conduct the training process in parallel. Finally, we show through experimental results that our proposed model outperforms most popular models and achieves a speedup of 2.7 in training time on 4 machines.
KW - model parallelism
KW - sentence embeddings
KW - unsupervised extractive text summarization
UR - http://www.scopus.com/inward/record.url?scp=85117749390&partnerID=8YFLogxK
U2 - 10.1145/3469968.3469987
DO - 10.1145/3469968.3469987
M3 - Conference contribution
AN - SCOPUS:85117749390
T3 - ACM International Conference Proceeding Series
SP - 86
EP - 92
BT - ICBDC 2021 - 2021 6th International Conference on Big Data and Computing
PB - Association for Computing Machinery
T2 - 6th International Conference on Big Data and Computing, ICBDC 2021
Y2 - 22 May 2021 through 24 May 2021
ER -