Toward Building a Comprehensive Phrase-based English-Arabic Statistical Machine Translation System

: This paper explores a phrase-based statistical machine translation (PBSMT) pipeline for English-Arabic (En-Ar) language pair. The work surveys the most recent experiments conducted to enhance Arabic machine translation in the En-Ar direction. It also focuses on free datasets and linguistically motivated ideas that enhance phrase-based En-Ar statistical machine translation (SMT) as it is as aims to use those only in order to build a large scale En-Ar SMT system. In addition, the paper highlights Arabic linguistic challenges in Machine Translation (MT) in general. This paper can be considered a guide for building an En-Ar PBSMT system. Furthermore, the presented pipeline can be generalized to any language pairs.


INTRODUCTION
Developing an automatic Machine Translation (MT) system over the history poses many challenges to researchers.It has been worked on since the World War I when there was lack of human translators and the need of instance translation was highly needed.Machine Translation has been tackled with various techniques such as:  Direct approach: It is the first type of MT to appear.It is called word-for-word translation.It is more likely to be a bilingual dictionary; each word in the source language is looked after in the dictionary to come up with the corresponding word in the target language.The process was divided into three steps: 1) Pre-processing the source text: analyze source text morphologically and extract the lemma forms.
2) Dictionary look up: find the translation of a single source word in a target language dictionary.3) Final output: generate the whole sentence after looking up each word separately.There is an obvious drawback of this approach which is neglecting the sentence connections in the translation process.This suggests a more inter-processes in the three aforementioned steps.
 Transfer approach: It aims at transferring the source text into the target text through a middle-ware.The middleware is the syntax analysis of both languages.The steps can be formulated as following: 1) Analyze: Analyze the source text and parse it in order to get its parse tree.
2) Transfer: Transfer the source text parse tree into a new parse tree for the target language.
3) Generate: Generate the target text from the new parse tree.
 Interlingua approach: It is trying to find a universal language that any language can be translated into.This universal language aims to be independent and an intermediate between source and target texts.However, this approach' idea is to represent the semantic analysis of the source text in an abstract logical form.
 Statistical approach: It does not require prior linguistic knowledge; and this is the most important advantage of this approach.Statistical Machine Translation (SMT) is a promising direction in MT field; with the available huge amount of parallel data (translated documents) statistical models can be trained efficiently to translate between any two language pair.In this paper we are going to focus on SMT among other methods in MT field.SMT is an MT paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text named as parallel corpora.Despite the fact that SMT is widely used more than other paradigms it has shortcomings, some of them are: o Corpus creation can be costly for users with limited resources.o The results are unexpected.Superficial fluency can be deceiving.o The benefits are overemphasized for European languages.

STATISTICAL MACHINE TRANSLATION SYSTEM PIPELINE
To the best of our knowledge, there are currently no surveys that explore state of the art linguistic ideas, tools and available datasets for building an En-Ar SMT system.While Ebrahim et al. in Ref. [15] surveyed different modifications that contribute to the enhancement of En-Ar SMTs, they did not review available free tools and datasets in order to conduct real experiments.The aim of this work is to present such a survey.
SMT systems usually fall under one of two categories: phrase-based models or tree-based models (hierarchical phrasebased and syntax-based).SMT for either category, has a pipeline which is identical for any pair of languages; in other words, it is a linguistically independent pipeline.Enhancing specific linguistics features has been shown to boost the automatic evaluation scores.This survey lists the Arabic, the target language of this survey, linguistically motivated ideas for En-Ar SMT system.This paper will focus on phrase-based SMT systems and the linguistics enhancements proposed for Arabic.
As illustrated in figure 1, an SMT system starts with a parallel text for the language pair.The parallel text (also known as parallel corpus) should be aligned on a sentence level; each line in one of the two files representing the language pair has its translation in the corresponding line number in the other file.In addition to the parallel corpus the pipeline needs a monolingual corpus for the target language to train the language model (in our case it will be for the Arabic language).The idea of phrase-based SMT is summarized in three steps: 1. Create a lexicon of parallel phrases.
2. Calculate (estimate) the score or the probability of each possible translation for each phrase.
3. For a new sentence, search for the translation that obtains the highest score.This process is called decoding.

Figure 1 SMT System Architecture
The first step in the pipeline is the alignment process.The parallel corpus is aligned using historically known IBM alignment models Ref. [39].The alignments are extracted from an intersection of bidirectional alignments (En-Ar and Ar-En), in addition to some union alignments of the two processes.At this point, it is easy to extract a maximum likelihood lexical translation detailed, easy to understand step-by-step explanation of the alignment idea, can be found in a workbook published on the website of the Information Science Institute (ISI) at South California University1 .
A language model (LM) is also trained using the monolingual corpus of the target language (the Arabic language in this survey).The LM indicates the extent to which the resultant target sentence is actually a valid Arabic language sentence, while the translation model (TM) (i.e.) is trained with the parallel En-Ar corpus.After training the language model from the monolingual corpus and TM from the parallel corpus, new sentences are ready to be translated by the system.Both models, LM and TM, inherit from the noisy channel mode.LM can be a trigram model, a factored model, or other types.
The noisy-channel model uses Bayes rule and illustrated in the context of SMT in equation (3). Hence, Then, the decoder searches for the best translation for each sentence, which is a set of phrases.In other words the decoder searches for the translation that has the highest probability.The decoder should (in an ideal case) extract every possible translation; it should take each word and search for its best match, then take the second word and find the possible routes to achieve the highest translation score.This ideal case is impossible with current computing resources as the number of available phrases increases exponentially with respect to matched entries in the phrase table.Instead, the decoder does a heuristic search by discarding less promising hypotheses which allows the search process to be feasible.The decoding task or the search task aims to find the target sentence (Arabic) that has the highest probability after calculating the product of the LM score and the TM score.Mathematically, this task of searching for a new Arabic sentence, can be formulated as: All sentences in Arabic is the set denoted as in equation 4. Hence, a potential translation score is the product of two scores: the LM score, gives a prior distribution for which sentences are likely to be valid Arabic, and the TM score, which indicates how likely an English sentence can be as a translation of the target sentence.
The rest of this paper is organized as follows: an exploration of Arabic linguistics challenges is presented in the next section, section three lists available corpora(monolingual and parallel) for training an SMT, section four has the language independent tools and frameworks in order to implement the SMT pipeline.Finally last section, the core of this paper, has the linguistic tools and techniques that proved to enhance En-Ar SMT evaluation scores.

ARABIC CHALLENGES IN MT
Despite the fact that SMT systems have shown reasonable results in close language pairs, the same results were not achieved for distant language pairs.The fact that English to Arabic is a distant pair has meant that SMT systems targeting that pair, have achieved unsatisfactory results.Moreover, Arabic has its own challenges within the Natural Language Processing (NLP) field.In the following sub-sections, the most important of those challenges, are presented.

A. Orthographic Challenges
Arabic has a complex orthography when it comes to computational linguistics; for example, in the Optical Character Recognition(OCR) field, scientists face the problem of connected Arabic letters which is very different from English where each letter is separated from the others by a space.Moreover, a letter in Arabic can have three different forms depending on whether it appears at the beginning, in the middle or at the end of a word.
Having complex orthography, it is difficult for majority of writers to produce a correct Arabic form.For example, it is confusing for many to differentiate between ‫ى"(‬ َ ‫"عل‬ which means the preposition on) and ‫"علي"(‬ which means a proper name Ali).This issue increases the sparsity (i.e.having many forms of the same word) and ambiguity (i.e. a word has multiple forms) in SMT language model and translation model training.In particular, various forms of Hamzated Alif " ‫آ،‬ ‫إ‬ ‫أ،‬ "exist in almost all Arabic scripts with no Hamza ‫."ا"‬ Two forms of Ya are also used incorrectly at the end of a word: the dotless Ya ( ‫ى‬ ) or the Alif-Maqsura and the dotted Ya ( ‫ي‬ ) Ref. [17]. 2.Morphological Challenges Arabic is a morphologically complex language.Compared with the English language, Arabic is richer than English morphologically.Arabic words are inflected for number and gender, and can be attached to different clitics such as: • conjunction(w + means: 'and').
As an example, the nominal phrase wbsyyArAtnA ‫َا‬ ‫,وبسيارتن‬ and the verbal phrase wsnkAtbhum ‫كاتبھم‬ ُ ‫وسن‬ are cliticized as following: 1. w+ s+ n+ kAtb+ hum and+ will+ we+ write+ to them 2. w+ b+ syyAr+ At+ nA and+ with+ car+ PL+ our The richness in Arabic morphology leads to having many surface forms in the parallel corpus, when compared to the English side and the sparsity problem appears.El Kholy and Habash in Ref. [18] said that While the number of (morphologically untokenized) Arabic words in a parallel corpus is 20% less than the number of corresponding English words, the number of unique Arabic word types is over twice the number of unique English word types over the same corpus size. 3.Syntactical Challenges Arabic syntax is more complex than English syntax.Among many syntactical issues in the Arabic language in the NLP field, three issues appear in the MT field: the adjectives, the verb-subject order and Idafa construct (equivalent to the English possessive, of-relationship, and compound nouns).Illustrative examples in this section are from Ebrahim et al. in Ref. [15].

1) Arabic Adjectives
The structure of the noun phrase in Arabic is different than English; the Arabic adjective that modifies a noun agrees with the noun in definiteness, thus a definite article is added to it if the noun is definite and vice versa: 1. Alyd Alkbyra the hand the big En: The big hand.

yd kbyra hand big
En: A big hand.

2) Verb-subject Order
Closed language pairs seem to have similar structure order.Since Arabic is distant from English, it has a different order: Verb-Subject-Object order(VSO), while English has the Subject-Verb-Object (SVO) order.SVO order exists in Arabic but with lower frequency than VSO order and is not preferable.Examples in ( 1) and ( 2) illustrate different ordering in Arabic.Example (3) illustrates the gender agreement in VSO order between the verb and the subject while example (4) illustrates the agreement in verb-subject gender and number in SVO order.Ar: ‫الدروس‬ ‫كتبوا‬ ‫األوالد‬

3) Idafa Construct
The Idafa construct in Arabic is the equivalent version of the English possessive, of-relationship, and compound nouns.The translation of the three structures are the Idafa construct in Arabic, which contains one or more indefinite nouns then a definite noun.The English phrases (the student books, the student's books and the books of the student) for example are translated into one Arabic phrase which is (ktb AlTAlb -‫الطالب‬ ‫ُب‬ ‫ت‬ ُ ‫ك‬ ).

AVAILABLE CORPORA
Third world universities lack funding in many fields including NLP and SMT.Datasetavailability was crucial for us to be able to carry out practical experiments.While many such dataset are avialable for purchase, lack of funding has forced us to search for freely available to use datasets for En-Ar SMT bearing in mind that an SMT system needs both a monolingual corpus and a parallel corpus to train both the language model and the translation model.In the following section, we will introduce the free datasets for En-Ar SMT that we were able to locate.

A. The United Nations Parallel Corpus
In 2009 Rafalovitch et al. in Ref. [46] published a parallel corpus for six language extracted from United Nations (UN) documents, and this corpus can be downloaded in different formats 4 .This corpus was extracted from the translation memories of the UN by individual researchers and it was not officially published by the UN.
Then, in 2010 Eisele et al. in Ref. [16] described the extraction process of the UN documents.They discussed methods used for crawling and formatting documents as well as for sentence alignment.Moreover, they provided a test set that can help in the evaluation of an SMT system.The paper is available for reading, but at the time of writing this paper, the corpus download page is encountering an error 5 .
In 2016 an official UN parallel corpus was released Ref. [53].The new corpus was published in the six official UN languages and was sentence aligned.Moreover, the authors provided the parallel corpus with development and test sets that can be used in any SMT system.The En-Ar parallel corpus contains 111,241 files with 18,539,207 lines.Most SMT systems targeting English-Arabic transaltion, employed a a smaller number of lines for training.The size of the new UN corpus is promising for building future En-Ar SMT systems because the quality of SMT systems often relies heavily on the size of the training corpora in boththe language model and the translation model.The new UN corpus is also known as UNv1.0 corpus and can be downloaded in different formats 6 .

B. The Linguistic Data Consortium Corpora
The Linguistics Data Consortium(LDC) is an institution that publishes a wide set of language resources periodically 7 .LDC is rich in parallel and monolingual corpora.Eventhough these corpora are not free LDC provides an application for a data-scholarship that once granted, allows its user to access the datasets freely.The data-scholarship program is offered twice a year, the first round takes place mid September while the second round is in mid January 8 .
With respect to the task of EN-AR SMT, LDC has released the fifth edition of the Arabic Gigaword corpus, which is the most widely used monolingual corpus to train the language model in any Arabic SMT research paper.The Arabic Gigaword corpus, consists of Arabic news articles collected from nine online news sources (such as: Asharq Al-Awsat, Agence France Presse ... and Al Hayat) 9 .The fourth edition of the same corpus has 848469 separated tokens in 2716995 documents 10 .
LDC has also released a number of Ar-En parallel corpora examples of which are those with the following catalog numbers: LDC2014T03, LDC2014T08, LDC2014T19, LDC2014T22, LDC2014T05, LDC2014T10 and LDC2014T14, LDC2013T14, and LDC2013T10 11 .In 2007, LDC released an automatically extracted parallel dataset LDC2007T08 12 .If this corpus results in improved evaluation scores in any En-Ar or Ar-En SMT systems, this will be a huge progress towards MT; because it means that building an En-Ar SMT system will not need a human-translators in order to have a ready parallel corpus.However, this has yet to be investigated.
In order to evaluate an SMT system, a set of source text and human translation references are required.LDC published a number of evaluation sets such as: LDC2014T02, LDC2013T07, LDC2013T03, LDC2010T10, LDC2010T11, LDC2010T12, LDC2010T14, LDC2010T17, LDC2010T21, LDC2010T23 and LDC2010T01.These datasets are all availed by NIST OpenMT 13 .According to the official website, NIST OpenMT is an evaluation series that supports research in, and helps advance the state of the art of, machine translation (MT).Most of the NIST evaluation sets target evaluating Ar-En systems To overcome this issue in evaluating an En-Ar SMT system, some researchers duplicate the Arabic translation four times in order to have more references and get an efficient automatic evaluation score.

C. Abu El-Khair Corpus
Ibrahim Abu El-Khair published in Ref. [19] Abu El-Khair Corpus, which is a Modern Standard Arabic Corpus.Abu El-Khair corpus reported to have more than five million news paper articles and a billion and a half words total.

LANGUAGE INDEPENDENT TOOLS
The phrase-based statistical machine translation (PBSMT) system trains using aligners and language model creators.Then, a decoder uses the trained models in order to translate new sentences.Since SMT tends to be language independent, most of the papers experiments' results are reported with automatically evaluated scores.In this section, we will explore the language independent tools for building a PBSMT system and the different automatic evaluation scores.

A. Translation Model Generators
GIZA++ is the most popular free parallel corpus aligner Ref. [39] 14 .MGIZA is the multi-threaded version of GIZA++ and is kept up to date to work with most compilers 15 .In 2016 Cadigan et al.Ref. [8] released a distributed computed version of GIZA++ implemented over Spark by Apache; its speed is reported to be up to 5.6x that of GIZA++ and 2.6x that of the multi-threaded version of MGIZA.
Berkeley has an active research group in the area of NLP with many of its projects and tools published on its official website 16 .One of those tools is the BerkeleyAligner which is a word alignment software package that implements novel algorithms for unsupervised word alignment 17 .
Anymalign is another aligner that is described by Lardilleux et al. in Ref. [36].One of its reported main advantages over similar tools is that it can align any number of languages simultaneously 18 .Finally, there is Chaski which is a distributed PBMT training tool based on Hadoop 19 .

B. Language Model Generators
Heafield Ref. [30] described the implementation of KenLM, a language model which is reported to have smaller and faster language model queries 20 .In benchmarking experiments done with Europarl parallel corpus, KenLM was reported to outperform the speed of BerkleyLM Ref. [45] by 4.49x 21 .KenLM was later integrated with Moses Ref. [34] (Moses is SMT decoder, and a comprehensive SMT system that will be discussed later in this section).KenLM is LGPL licensed (i.e.available for commercial use).SRILM Ref. [49] is a toolkit for applying statistical LMs.SRILM is used in SMT and other NLP subfields.It has been under development in the SRI Speech Technology and Research Laboratory since 1995.SRILM is free to use in projects that do not receive external funding other than government research grants and contracts 22 .IRSTLM, like KenLM, is an LGPL licensed toolkit for generating statistical LMs, and is available for commercial use 23 .
BerkleyLM is a library for storing large n-gram LMs efficiently in memory.The BerkleyLM is described in Ref. [45], and is reported to be fasterthan SRILM and nearly as fast as KenLM despite the reported results in the KenLM benchmark experiments 24 .RandLM Ref. [50] is an LM that uses randomized data structures, which is different from both SRILM and IRSTLM.The tool is recommended for use by Moses in its official web page when a user wants to build the largest LMs possible (e.g. a 5-gram on one hundred billion words).The result can be ten times smaller LMs than other LM toolkits.Talbot and Osborne described the technical details of RandLM in Ref. [50].RandLM can be downloaded from Sourceforge 25 .
In 2013 Vaswani et al.Ref. [52] published a Neural Probabilistic Language Model toolkit (NPLM) 26 .Then, in 2014, Paul et al. described Ref. [44] a neural network LM framework for machine translation which they called OxLM(Oxford LM).The framework can be downloaded from Sourceforge 27 .

C. Decoders
Moses is an Open Source Toolkit for Statistical Machine Translation Ref. [34].An SMT framework is more a precise description for Moses; because most of the mentioned LMs and translation model training toolkits are integrated within Moses.Moreover, Moses has been technically supported since 2005.It was built by researchers for research, has an active support mailing list and its code is shared on github.Different institutions contributed to Mosesâ€™s development including: University of Edinburgh (UK), Fondazione Bruno Kessler (Italy), Charles University (Czech Republic), DFKI (Germany), RWTH Aachen (Germany) and others.Its website is a wide gate towards understanding, implementing and enhancing SMT systems 28 .There is a documentation on the website that illustrates the installation steps and usage for all integrated tools.Moses is also available for commercial use because it is LGPL licensed.In addition, Moses has an experimental management system described in Ref. [35], which automates the whole SMT pipeline (illustrated in figure 2), and the workflow is automatically generated (e.g. in figure 3).A transliteration model recently integrated with Moses, and was described and implemented by Durrani et al.Ref. [13].A transliteration tool is important in Arabic SMT; because Arabic is not written in roman characters.This causes the increase of out of vocabulary words (OOVs), and transliteration can help reduce OOVs by transliterating Named Entities (NEs).
Thot is a toolkit for PBSMT.Ortiz et al. published Ref. [41] the new interactive toolkit.The new toolkit comes with number of improvements such as: Integration with a set of pre/post-processing tools, increased portability(compiled in many different platforms), improved checking technique for runtime errors, early detection of bugs using built-in checks, and translation can be executed in parallel either through multi-threading or distributed-computing paradigms.It has a detailed and reviewed manual 29 .
Developed by Stanford, Phrasal is another phrase-based SMT decoder written in Java.The details of the Phrasal decoder are described in Ref. [2].Docent is a document-level phrase based SMT Ref. [3].It is worth noting that the Docent team acknowledges the work in Moses and KenLM for producing the Docent system 30 .Dyer et al.Ref. [14] described cdec, which is a decoder, aligner and a model optimizer for SMT based on context-free formalisms.
GREAT Ref. [23] Ref. [24] is a decoder based on stochastic finite-state transducers, which includes a training toolkit.Gonzalez et al. described the latest enhancements to the GREAT decoder in Ref. [25].The research lab that published GREAT has a description of other MT tools on its website 31 .Interactive GREAT (iGREAT) is available for download 32 .
Marie is an ngram-based SMT decoder developed in 2006 by Josep M. Crego as part of his PhD thesis Ref. [37].The decoder details were published in the Computational Linguistics Journal with the title "N-gram-based machine translation" and the tools are available for research purposes 33 .
Phramer is an open-source statistical phrase-based machine translation decoder that was released in 2006 Ref. [40] and is available for downloaded 34 .

Pharaoh is another machine translation decoder for phrase-based systems released to the research community in 2004
Ref. [32].It aimed to aid research in SMT.It was developed by Philipp Koehn as part of his PhD thesis at the University of Southern California and the Information Sciences Institute.It is worth noting that Pharaoh is the first phrase based SMT decoder and Philipp Koehn is the founder and a main contributor to the Moses system community that was mentioned earlier in this section.

D. Automatic Evaluation Metrics
Despite the fact that automatic evaluation for SMT is still a controversial topic, automatic evaluation scores are frequently used in reporting experiment results.BLEU (BiLingual Evaluation Understudy) was the first introduced automatic evaluation score Ref. [42].It is a quick and language-independent score.It relies on a numerical metric for translation closeness; as it is believed that, the closer an MT is to a human translation, the better it is.In addition, it relies on good quality human translations (called references) of the test corpus.
There are other automatic evaluation metrics such as: Translation Error Rate (TER) which measures the number of edits required to change a system output into one of the references 35 , METEOR 36 and RIBES 37 metrics.

E. Manual Evaluation Metrics
MT human evaluation is an topic as it is often considered the most reliable way for evaluating a MT system. .However manual evaluation is often very costly.This is the reason that motivated Chatzitheodorou Ref. [10] to release COSTA MT, which is an evaluation tool and an open-source Java software that can be used to facilitate the manual evaluation of MT output.As is reported, COSTA is simple to use and is designed to allow developers and users of MT systems analyze their engines within a friendly environment.It ranks the quality of MT system output segment-by-segment for a particular language pair 38 .Appraise Ref. [21] is another open-source tool for manually evaluating MT output.the author of Appraise has described it as a tool that allows the collection of human evaluations on translation output, implementing annotation tasks, and manual post-editing Ref. [21].Appraise has also been used in the ACL WMT evaluation campaign.

TOOLS FOR LINGUISTICS ENHANCEMENTS
This paper focuses on SMT for the English-Arabic direction, even though SMT tends to be language independent.The reason for this is that we are interested in vitalizing the Arabic language as it was approved by many studies that learning with the native language is more effective and enhances creativity.
In Ref. [15], Ebrahim et al. state that machine translation in the Ar-En direction has more funding institutions than in the En-Ar direction.This statement is reported by Farghaly and Shaalan Ref. [20] who have explained that the need to understand what is said and written in Arabic has risen significantly after the event of September 11th, 2001.The applies to communication in airports, text messages and via telephone calls, and there was a lack of human translators.This fact was also stated Koehn Ref. [33] who has said that Due to the involvement of US funding agencies, most research groups focus on the translation from Arabic into English and Chinese into English.Next to text-to-text translation, there is increasing interest in speech-to-text translation.
It is logical that any enhancements that improve the Ar-En direction should have an impact on En-Ar.This proved to be true by many studies aimed at improving Ar-En by processing the Arabic language on different levels (i.e.orthographically, morphologically, and syntactically).There are different processing methods for the Arabic corpus: Morphological tokenization/detokenization, orthographic normalization/denormalization, syntactic reordering and Part of Speech Tagging(POS).On the other side of the corpus, processing the English language proved to help in SMT efficiency such as: POS, down-casing, cleaning (e.g.adding spaces around punctuation) and Named Entity Recognition(NER).In this section, we will explore most recent studies targeted processing Arabic SMT and tools to perform them.

A. Orthographically Processing Techniques 1) Orthographic Normalization
Orthographic Normalization is an Arabic text pre-processing step which normalizes some miss-written characters to one base same form.It is sometimes simply refered to as "a normalization process" Ref. [6] Ref. [18].Depending on the characters, there are two normlaization forms: reduced normalization(RED) and enriched normalization(ENR).Reduced normalization converts all Hamzated Alif( ‫إ‬ ، ‫أ‬ ، ‫آ‬ ) into bare Alif ( ‫ا‬ ), and turns Alif Maqsura or the dotless Ya( ‫ى‬ ) into a dotted Ya ( ‫ي‬ ), while enriched normalization chooses the appropriate form of Alif.The two forms were introduced in Ref. [18]  with the reduced form only. Linguistically, the two forms change the meaning of some words and lead to noncorrect Arabic text, but the enriched form of Arabic is more realistic and is desired to evaluate against.

2) Orthographic De-normalization
Orthographic De-normalization is a post-processing technique, to de-normalize normalized text.In order to produce correct Arabic script, a reduced tokenized (a morphological process that will be discussed in the next subsection) output should be enriched and de-tokenized.Two methods were proposed by El Kholy and Habash in Ref. [17].
Normalizing text to the reduced form can be done through a simple characters substitution script, but converting it to an enriched form requires a machine learning algorithm.As stated above the enriched form can be produced using MADA toolkit Ref. [29].There is a problem downloading MADA at the moment of writing this paper, but MADA was integrated with AMIRA Ref. [12], a toolkit for Arabic processing in 2014 with a title MADAMIRA Ref. [43].MADAMIRA is free to use for research purposes 39 .
Orthographic normalization is not the first step to pre-process Arabic text; a cleaning step is advised to be the best start.SPLIT is a unified preprocessing tool for SMT corpora, its goal is to standardize the preprocessing steps to avoid the drastic changes that are lead by various preprocessing techniques.SPLIT has normalization options beside the cleaning steps.SPLIT was developed by the Natural Language Processing research lab in George Washington University 40 .ALBadrashiny et al. described in Ref. [1] the details of SPLIT.

B. Morphologically Processing Techniques 1) Morphological Tokenization
Morphological Tokenization is a pre-processing technique to separate the cliticized Arabic words into parts.Arabic words are highly cliticized (i.e. a word can have many part of speech tags (POS)), for example the word wsnkAtbhum ‫وسنكاتبھم‬ (which means "and we will write to them") is cliticized as following: w+ s+ n+ kAtb+ hum and+ will+ we+ write+ to them Tokenization reduces sparsity on the Arabic side of the parallel corpus, and without tokenization, Arabic will have more surface forms than on the English side Ref. [6].
The terms "morphological tokenization" and "segmentation" are often used interchangeably Ref. [6] Ref. [18], despite a claim that there is a difference between the two terms.El Kholy and Habash Ref. [18] illustrated the difference with an example: segmentation of maktbthom (their library) -‫مكتبتھم‬ is segmented as (maktbt + hum -‫ھم‬ + ‫)مكتبت‬ and this is not the right Arabic word (maktaba -‫مكتبة‬ ) .Some of the adjustment rules in the tokenization process according to El Kholy and Habash in Ref. [18], are illustrated in Figure 4.The work presnted in Ref. [6] reported that (T+R) was the best technique, while the work presented in Ref. [17] reported that (T+R+LM), one of the two techniques added later, was the best.
There are processing tools for Arabic morphological tokenization/detokenization, often called segmenters.The Stanford Arabic segmenter was released in 2014, and it is an implementation of the segmenter detailed in Ref. [38].The Stanford segmenter is available for research purposes 41 .It is also a tool for orthographic normalization.In 2014, MADAMIRA was released Ref. [43].MADAMIRA is a system for processing Arabic that has an Arabic morphological analysis and disambiguation module, in addition to a segmentation module.In 2016, QCRI (Qatar Computing Research Institute) released the FARASA segmenter Ref. [4].FARASA has other modules for Arabic processing inlcuding a POS tagger, a diacritizer, and a dependency parser.[11] for the English corpus, after tagging it with the Stanford POS tagger, and splitting the text into smaller sentences.After that they tagged them using a maximum entropy tagger Ref. [48].A named entity recognition(NER) step was carried out by the Stanford NER for location, person, and organization entities.They dicovered that the rule that replicates the "the" hurts the translation score.
On the other hand, Habash Ref. [28] experimented similar rules for the opposite direction (i.e.Ar-En SMT).The results were less promising compared with En-Ar direction.Arabic parsers have less quality than English parsers, this might be the cause.But few Arabic parsers were released after this publication such as: Stanford Arabic parser Ref. [26], and it is needed to be tested with an SMT task.

D. Multi-word Expressions
In the scope of machine translation, multi-word expressions are the phrases or sentences that have different meaning than the literal meaning for each word separately.Detecting MWEs is also an active area of research.It usually poses an issue for junior human translators who are not professional in the source language.A significant amount of research was done for detecting multi-word expressions, but the amount of research concerning integrating MWEs into MT systems is not significant.
Eventhough modeling MWEs in SMT is a hard task, Ghoneim and Diab Ref. [22] described three methods to integrate MWEs into the Moses SMT system.Their work was an extension of Carpuat and Diab Ref. [9].The study concentrated on how the integration methods are done, and focused less on how the MWE extraction process happens.MWEs were extracted from lexical databases, the English WordNet 3.0, and using named entity recognizers (NEs are a type of MWE).In the scope of detecting MWEs, an MWEtoolkit can help in this task; because an MWEtoolkit is a framework for language-independent MWE identification from corpora Ref. [47].Moreover, Attia et al. published an automatic technique to extract Arabic MWEs Ref. [5].One of the coauthors of this work, an Arabic linguist, published a manually extracted Arabic MWEs on his personal website 42 .

CONCLUSIONS
SMT is an active research area and it is needed to have concerned researchers, junior scientists and institutions to enhance En-Ar direction.Despite that En-Ar is less represented in SMT research community, the linguistic enhancements are promising for more improvements.Recently, SMT Arabic research community concerns with Ar-En dialectally (i.e.Egyptian Arabic-En, Syrian Arabic-En ,... etc), but we encourage working with Standard Arabic SMT instead of dialect; because the news, books, and science documents are needed to be understandable by all Arabic people not a single country.

1 .
ktb Alwld Aldrs wrote the boy the lesson En: The boy wrote the lesson.Ar: ‫الدرسَ‬ ‫الولد‬ ‫كتب‬ 2. Alwld ktb Aldrs the boy wrote the lesson En: The boy wrote the lesson.Ar: ‫الدرس‬ ‫كتب‬ ‫الولد‬ 3. ktb AlA'wlAd Aldrws wrote the boys the lessons En: The boys wrote the lessons.Ar: ‫الدروس‬ ‫األوالد‬ ‫كتب‬ 4. AlA'wlAd ktbw Aldrws the boys wrote the lessons En: The boys wrote the lessons.

Figure 5 A
Figure 5 A sentence in the various tokenization schemes.Source: Ref. [18]2) Morphological De-tokenizationMorphological Detokenization is a post-processing technique used to convert tokenized Arabic output to its original, uncliticized form.This process is called morphological detokenization and recombination interchangeably Ref.[17] Ref.[6].Four techniques for detokenization were introduced in Ref. [6] by Badr et al.: (S)Simple, (R)Rule-based, (T)Table-based and (T+R)Table+Rule.El Kholy and Habash Ref. [17] added two techniques: (T+LM)Table+Language Modeling, and (T+R+LM).The work presnted in Ref.[6] reported that (T+R) was the best technique, while the work presented in Ref.[17] reported that (T+R+LM), one of the two techniques added later, was the best.

3 )
[7]]ored ModelsFactored Models are used to make training more reliable.In 2013,Khemakhem et al.Ref.[31]highlighted a problem in MT scoring.The problem, as they mentioned, is that MT scoring relies on the words history more than other features of the words.For example, the word katab means (to write) and the word kotob means (books), have the same surface form but different meaning according to context.Here the diacritics of the Arabic words should be important features in training the parallel corpus.Khemakhem et al. proposed two features for Arabic words which are the the word and its syntactic class (e.g.noun, verb, particle and proper noun)., stem, and the POS tag along with the segmented clitics; for example, the Arabic word wlAwlAdh means (and for his kids) has the following factors: AwlAd and w+l+N+P:3MS.Syntactic Reordering is a process that aims at overcoming the linguistic gap between Arabic syntax and English syntax.English sentences are written in SVO order, while Arabic text favors the VSO order.Badr et al.Ref.[7]applied a set of rules on the source language (i.e.English) for a better alignment.Using a parse tree the rules are:1.NP(Noun Phrase): inverts all nouns, adjectives and adverbs into a NP. 2. PP(Prepositional Phrase): transforms prepositional phrases of the form N1 of N2 … of Nn into N1 N2 … Nn. 3. Definite article (the): replicates "the" before adjectives.4. VP(Verb Phrases): converts SVO order into VSO.Badr et al.Ref. [7] used the Collins parser Ref.
[6]lier in 2008, Badr et al.Ref.[6]experimented with factored models.The factors were on the two sides of the corpus (English and Arabic).English factors were: the surface form and POS tag, and the Arabic factors were: the 41 http://nlp.stanford.edu/software/segmenter.shtml surface form