Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus

Ahmed, Mohamed Attia; AlGhamdi, Fahad; Hawwari, Abdelati

doi:10.21608/ejle.2024.308019.1070

Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus

Document Type : Original Article

Authors

¹ RDI; www.rdi-eg.ai

² Al-Baha University, Al-Baha - Saudi Arabia, fghamdi@bu.edu.sa

³ Datalex4ai, Santa Clara – California - USA

10.21608/ejle.2024.308019.1070

Abstract

Paraphrasing is one of the major yet the most challenging tasks of the deep semantic analysis of natural languages. In this paper we present a novel algorithm that operates on a big parallel text corpus and automatically generates the paraphrases of the two natural languages of the corpus. Like several previously crafted algorithms in this regard, our algorithm exploits the bidirectional translation provided by the big parallel text corpora to infer couples of synonymous phrases, however, our algorithm is simpler and more efficient. Moreover, our algorithm is the only one that constructs the whole paraphrase through its run without any need for further post processing. We implemented and ran our algorithm on the English-Arabic text corpora from the 2018 version of the OpenSubtitles (OPUS) parallel text corpora, and through the statistical evaluation of random samples we found that the semantic quality among the phrases of the automatically generated paraphrases to be interestingly superb.

Keywords