TY - JOUR
T1 - Crawling Parallel Data for Bilingual Corpus Using Hybrid Crawling Architecture
AU - Cheok, Sai Man
AU - Hoi, Lap Man
AU - Tang, Su Kit
AU - Tse, Rita
N1 - Publisher Copyright:
© 2021 Elsevier B.V.. All rights reserved.
PY - 2021
Y1 - 2021
N2 - The quality of translation work mainly depends on the understanding of the words in their domain. If machine translation can accurately translate the words in a domain in different languages, it can even avoid any human communication error. To achieve this, a high-quality bilingual corpus is crucial as they are always the basis of state-of-the-art machine translation system. However, it is complicated to construct the corpus with large amount of parallel data. In this paper, a new crawling architecture, called Hybrid Crawling Architecture (HCA), will be proposed, which efficiently and effectively collects parallel data from the Web for the bilingual corpus. HCA aims at targeted websites, which contains articles in at least two different languages. As it is a mixture of Focused crawling architecture and Parallel crawling architecture, HCA takes advantages over both architectures. In intensive experiments on crawling parallel data of relevance topics, HCA significantly outperforms Focused crawling architecture and Parallel crawling architecture for 30% and 200% respectively, in terms of quantity.
AB - The quality of translation work mainly depends on the understanding of the words in their domain. If machine translation can accurately translate the words in a domain in different languages, it can even avoid any human communication error. To achieve this, a high-quality bilingual corpus is crucial as they are always the basis of state-of-the-art machine translation system. However, it is complicated to construct the corpus with large amount of parallel data. In this paper, a new crawling architecture, called Hybrid Crawling Architecture (HCA), will be proposed, which efficiently and effectively collects parallel data from the Web for the bilingual corpus. HCA aims at targeted websites, which contains articles in at least two different languages. As it is a mixture of Focused crawling architecture and Parallel crawling architecture, HCA takes advantages over both architectures. In intensive experiments on crawling parallel data of relevance topics, HCA significantly outperforms Focused crawling architecture and Parallel crawling architecture for 30% and 200% respectively, in terms of quantity.
KW - Bilingual Corpus
KW - Focused Crawler
KW - Parallel Crawler
UR - http://www.scopus.com/inward/record.url?scp=85124611548&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2021.12.218
DO - 10.1016/j.procs.2021.12.218
M3 - Conference article
AN - SCOPUS:85124611548
SN - 1877-0509
VL - 198
SP - 122
EP - 127
JO - Procedia Computer Science
JF - Procedia Computer Science
T2 - 12th International Conference on Emerging Ubiquitous Systems and Pervasive Networks, EUSPN 2021 / 11th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare, ICTH 2021
Y2 - 1 November 2021 through 4 November 2021
ER -