Crawling Parallel Data for Bilingual Corpus Using Hybrid Crawling Architecture

Research output: Contribution to journalConference articlepeer-review

12 Citations (Scopus)

Abstract

The quality of translation work mainly depends on the understanding of the words in their domain. If machine translation can accurately translate the words in a domain in different languages, it can even avoid any human communication error. To achieve this, a high-quality bilingual corpus is crucial as they are always the basis of state-of-the-art machine translation system. However, it is complicated to construct the corpus with large amount of parallel data. In this paper, a new crawling architecture, called Hybrid Crawling Architecture (HCA), will be proposed, which efficiently and effectively collects parallel data from the Web for the bilingual corpus. HCA aims at targeted websites, which contains articles in at least two different languages. As it is a mixture of Focused crawling architecture and Parallel crawling architecture, HCA takes advantages over both architectures. In intensive experiments on crawling parallel data of relevance topics, HCA significantly outperforms Focused crawling architecture and Parallel crawling architecture for 30% and 200% respectively, in terms of quantity.

Keywords

  • Bilingual Corpus
  • Focused Crawler
  • Parallel Crawler

Fingerprint

Dive into the research topics of 'Crawling Parallel Data for Bilingual Corpus Using Hybrid Crawling Architecture'. Together they form a unique fingerprint.

Cite this