Ahmed, M., AlGhamdi, F., Hawwari, A. (2024). Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus. The Egyptian Journal of Language Engineering, 11(2), 1-12. doi: 10.21608/ejle.2024.308019.1070
Mohamed Attia Ahmed; Fahad AlGhamdi; Abdelati Hawwari. "Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus". The Egyptian Journal of Language Engineering, 11, 2, 2024, 1-12. doi: 10.21608/ejle.2024.308019.1070
Ahmed, M., AlGhamdi, F., Hawwari, A. (2024). 'Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus', The Egyptian Journal of Language Engineering, 11(2), pp. 1-12. doi: 10.21608/ejle.2024.308019.1070
Ahmed, M., AlGhamdi, F., Hawwari, A. Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus. The Egyptian Journal of Language Engineering, 2024; 11(2): 1-12. doi: 10.21608/ejle.2024.308019.1070
Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus
2Al-Baha University, Al-Baha - Saudi Arabia, fghamdi@bu.edu.sa
3Datalex4ai, Santa Clara – California - USA
Abstract
Paraphrasing is one of the major yet the most challenging tasks of the deep semantic analysis of natural languages. In this paper we present a novel algorithm that operates on a big parallel text corpus and automatically generates the paraphrases of the two natural languages of the corpus. Like several previously crafted algorithms in this regard, our algorithm exploits the bidirectional translation provided by the big parallel text corpora to infer couples of synonymous phrases, however, our algorithm is simpler and more efficient. Moreover, our algorithm is the only one that constructs the whole paraphrase through its run without any need for further post processing. We implemented and ran our algorithm on the English-Arabic text corpora from the 2018 version of the OpenSubtitles (OPUS) parallel text corpora, and through the statistical evaluation of random samples we found that the semantic quality among the phrases of the automatically generated paraphrases to be interestingly superb.