Constructing High Quality Bilingual Corpus using Parallel Data from the Web

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Natural language machine translation system requires a high-quality bilingual corpus to support its efficient translation operation at high accuracy rate. In this paper, we propose a bilingual corpus construction method using parallel data from the Web. It acts as a stimulus to significantly speed up the construction. In our proposal, there are 4 phases. Parallel data is first pre-processed and refined into three sets of data for training the CNN model. Using the well-trained model, future parallel data can be selected, classified and added to the bilingual corpus. The training result showed that the test accuracy reached 98.46%. Furthermore, the result on precision, recall and f1-score is greater than 0.9, which outperforms RNN and LSTM models.

Original languageEnglish
Title of host publicationIoTBDS 2022 - Proceedings of the 7th International Conference on Internet of Things, Big Data and Security
EditorsDenis Bastieri, Gary Wills, Peter Kacsuk, Victor Chang
PublisherScience and Technology Publications, Lda
Pages127-132
Number of pages6
ISBN (Electronic)9789897585647
DOIs
Publication statusPublished - 2022
Event7th International Conference on Internet of Things, Big Data and Security, IoTBDS 2022 - Virtual, Online
Duration: 22 Apr 202224 Apr 2022

Publication series

NameInternational Conference on Internet of Things, Big Data and Security, IoTBDS - Proceedings
Volume2022-April
ISSN (Electronic)2184-4976

Conference

Conference7th International Conference on Internet of Things, Big Data and Security, IoTBDS 2022
CityVirtual, Online
Period22/04/2224/04/22

Keywords

  • Bilingual Corpus
  • CNN Modelling
  • Machine Translation
  • Parallel Data

Fingerprint

Dive into the research topics of 'Constructing High Quality Bilingual Corpus using Parallel Data from the Web'. Together they form a unique fingerprint.

Cite this