Crawling Parallel Data for Bilingual Corpus Using Hybrid Crawling Architecture

研究成果: Conference article同行評審

13 引文 斯高帕斯(Scopus)

摘要

The quality of translation work mainly depends on the understanding of the words in their domain. If machine translation can accurately translate the words in a domain in different languages, it can even avoid any human communication error. To achieve this, a high-quality bilingual corpus is crucial as they are always the basis of state-of-the-art machine translation system. However, it is complicated to construct the corpus with large amount of parallel data. In this paper, a new crawling architecture, called Hybrid Crawling Architecture (HCA), will be proposed, which efficiently and effectively collects parallel data from the Web for the bilingual corpus. HCA aims at targeted websites, which contains articles in at least two different languages. As it is a mixture of Focused crawling architecture and Parallel crawling architecture, HCA takes advantages over both architectures. In intensive experiments on crawling parallel data of relevance topics, HCA significantly outperforms Focused crawling architecture and Parallel crawling architecture for 30% and 200% respectively, in terms of quantity.

指紋

深入研究「Crawling Parallel Data for Bilingual Corpus Using Hybrid Crawling Architecture」主題。共同形成了獨特的指紋。

引用此