Abstract
The quality of translation work mainly depends on the understanding of the words in their domain. If machine translation can accurately translate the words in a domain in different languages, it can even avoid any human communication error. To achieve this, a high-quality bilingual corpus is crucial as they are always the basis of state-of-the-art machine translation system. However, it is complicated to construct the corpus with large amount of parallel data. In this paper, a new crawling architecture, called Hybrid Crawling Architecture (HCA), will be proposed, which efficiently and effectively collects parallel data from the Web for the bilingual corpus. HCA aims at targeted websites, which contains articles in at least two different languages. As it is a mixture of Focused crawling architecture and Parallel crawling architecture, HCA takes advantages over both architectures. In intensive experiments on crawling parallel data of relevance topics, HCA significantly outperforms Focused crawling architecture and Parallel crawling architecture for 30% and 200% respectively, in terms of quantity.
Original language | English |
---|---|
Pages (from-to) | 122-127 |
Number of pages | 6 |
Journal | Procedia Computer Science |
Volume | 198 |
DOIs | |
Publication status | Published - 2021 |
Event | 12th International Conference on Emerging Ubiquitous Systems and Pervasive Networks, EUSPN 2021 / 11th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare, ICTH 2021 - Leuven, Belgium Duration: 1 Nov 2021 → 4 Nov 2021 |
Keywords
- Bilingual Corpus
- Focused Crawler
- Parallel Crawler