跳至主導覽 跳至搜尋 跳過主要內容

Corpora for document-level neural machine translation

  • Siyou Liu
  • , Xiaojun Zhang
  • Xi'an Jiaotong-Liverpool University

研究成果: Conference contribution同行評審

10 引文 斯高帕斯(Scopus)

摘要

Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aim to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.

原文English
主出版物標題LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
編輯Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
發行者European Language Resources Association (ELRA)
頁面3775-3781
頁數7
ISBN(電子)9791095546344
出版狀態Published - 2020
事件12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
持續時間: 11 5月 202016 5月 2020

出版系列

名字LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
國家/地區France
城市Marseille
期間11/05/2016/05/20

指紋

深入研究「Corpora for document-level neural machine translation」主題。共同形成了獨特的指紋。

引用此