Egyptian Dialect Stopword List Generation from Social Network Data

This paper proposes a methodology for generating a stopword list from online social network (OSN) corpora in Egyptian Dialect(ED). The aim of the paper is to investigate the effect of removingED stopwords on the Sentiment Analysis (SA) task. The stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect to be used with the OSN corpora. We compare the efficiency of text classification when using the generated list along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The text classification was performed using Na\"ive Bayes and Decision Tree classifiers and two feature selection approaches, unigram and bigram. The experiments show that removing ED stopwords give better performance than using lists of MSA stopwords only.


INTRODUCTION
The web has become a very important source of information recently as it becomes a read-write platform. The dramatic increase of OSN, video sharing sites, online news, online reviews sites, online forums and blogs has made the user-generated content, in the form of unstructured free text gains a considerable attention due to its importance for many businesses. The web is used by many languages' speakers. It is no longer used by English speakers only. There are many users on OSN use other languages than English e.g. Arabic.
Arabic is a Semitic language as in Ref. [1] and consists of many different regional dialects. However, these dialects are true native language forms which are used in informal daily communication and are not standardized or taught in schools as in Ref. [2]. Despite this fact but in reality the internet users especially on OSN sites and some of the blogs and reviews site as well, use their own dialect to express their feelings. The only formal written standard for Arabic is the MSA. It is commonly used in written media and education. There is a large degree of difference between MSA and most Arabic dialects as MSA is not actually the native language of any Arabic country as in Ref. [3]. The need of SA systems that can analyze OSN in Arabic Dialect is compulsory.
Sentiment Analysis is the computational study of people's opinions, attitudes, and emotions towards topics covered by reviews or news as in Ref. [4]. SA is considered also a classification process which is the task of classifying text to represent a positive or negative sentiment as in Ref. [5] - [7]. The classification process is usually formulated as a two-class classification problem; positive and negative. Since it is a text classification problem, any existing supervised learning method can be applied, e.g., Naïve Bayes (NB) classifier.
There is lack of language resources of Dialect Arabic (DA) as most of them are developed for MSA. In order to use DA in SA, there are some text processing techniques are needed like removing stopwords or Part-of-Speech (POS) tagging. There are some sources of stopword lists and POS taggers are publicly available but they are all on MSA not DA. This paper tackles the problem of ED removing stopwords. The Egyptian users are the most commonly users of OSN among Arab countries.
Stopwords are more typical words used in many sentences and have no significant semantic relation to the context in which they exist. There are some researchers that have generated stopword lists but as far as our knowledge no one has generated a stopword list for ED. Reference [8] has proposed an algorithm for removing stopwords based on a finite state machine. They have used a previously generated stopword list on MSA. Reference [9] has created a corpus-based list from newswire, query sets and a general list using the same corpus. Then, they compare the effectiveness of these lists on the information retrieval systems. The lists are on MSA too. Reference [10] has generated a stopword list of MSA from the highest frequent meaningless words that appear in their corpus.
The aim of this paper is to investigate the effect of removing stopwords on SA for OSN Egyptian Dialect data. The data are collected from OSN sites Facebook and Twitter as in Ref. [11] - [20] on Egyptian movies. We used an Arabic review site as well that allow users to write critics about the movies (https://www.elcinema.com). The used language by the users in the review is syntactically simple with many words of Egyptian dialects included. The data from OSN is characterized by being noisy and unstructured. Abbreviations and smiley faces are frequently used in OSN and sometimes in review site too. There is a need for many preprocessing and cleaning steps for this data to be prepared for SA [21]. The Arabic users either write with Arabic or with Franco-arab (writing Arabic words in English letters) e.g. the word "maloosh" which stands for ‫"ﻣﺎﻟﻮش"‬ which means "doesn't have". This is an Egyptian dialect word which is written in MSA as " ‫ﻟﯿﺲ‬ ‫ﻟﮫ‬ ". Sometimes they use English word in the middle of an Arabic sentence which must be translated.
We are tackling the problem of classifying reviews and OSN data about movies into two classes, positive and negative as was first presented in Ref. [5]; but on Arabic language. In their work they used unigram and bigram as Feature Selection (FS) techniques. It was shown in Ref. [5] that using unigrams as features in classification gives the highest accuracy with NB. We have used the same feature selection techniques, unigram and bigram along with NB and Decision Tree (DT) as classifiers.
We have proposed a methodology of generating stopword lists from the corpora. The methodology consists of three phases which are: calculating the words' frequency of occurrence, check the validity of a word to be a stopword, and adding possible prefixes and suffixes to the words generated.
The contribution of this paper is as follows. First, we propose a methodology for creating a stopword list for Egyptian dialect to be suitable for OSN corpora. Third, we used a prepared corpus from Facebook which was not tackled in the literature before for ED. Fourth; tackling OSN data in ED is new as it wasn't investigated before. Finally, tackling DT classifier with these kinds of corpora is new as it wasn't investigated much in the literature.
The paper is organized as follows; section 2 presents the methodology. The stopword list generation is tackled in section 3. The Experimental setup and results are presented in section 4. A discussion of the results and analysis of corpora is presented in section 5. Section 6 presents the conclusion and future work.

METHODOLOGY
The data used was collected from Twitter, Facebook, and a review site on the same topic in ED. It was on a hot topic on the recently shown movies in the theatres for the last festival in first of August 2014. The movies were:" ‫اﻟﻔﯿﻞ‬ ‫"اﻷزرق‬ means "The blue elephant"; ‫ﻣﺼﺮ"‬ ‫ﻓﻰ‬ ‫"ﺻﻨﻊ‬ means "Made in Egypt"; ‫اﻟﺘﺎﻟﺘﺔ"‬ ‫اﻟﻌﺎﻟﻤﯿﺔ‬ ‫"اﻟﺤﺮب‬ means "The third world war"; and ‫ﻣﯿﺮي"‬ ‫"ﺟﻮازة‬ means "official marriage". The data downloaded was; related tweets from twitter, comments from some movies' Facebook pages, and users' reviews from the review site elcinema.com.

A. Corpora Preparation
The data downloaded are prepared to be able to be fed to the classifier as shown in Fig. 1. This was proposed before in [21].

Figure 1: Arabic Corpora Preparation from Reviews, Facebook, and Twitter
The number of comments after removing the comments that contain URLs only or advertising links from Facebook was 1459. Removing comments expressed by photos only reduced them to 1415. Removing comments that contain mentions to friends with no other words reduced them to 1296. Then, after removing non-Arabic comments, they were reduced to 1261.
The final number of tweets downloaded was 1787 tweets. After removing the tweets that contain URLs only or advertising links or some who put links to watch the movie only, they were reduced to 1069. Some were links to certain scenes or related videos on Youtube. After removing unrelated tweets as the search on twitter was just by the movies' names which can imply other meanings, they were reduced to 862. Removing non-Arabic tweets reduced them to 781.
The number of reviews downloaded from the review sites was 32. The reviews needed only two steps of preparation as shown in Fig. 1.
After the preprocessing, cleaning and filtering of the data, they must be annotated to be fed to the supervised classifiers. The first Experiment shows the method of annotation and the number of positive and negative data.

B. Text processing and Classification
After annotation, we have applied removing stopwords text processing technique on the three corpora with different alternatives of stopwords list which are: -A general MSA list: this list contains a combination of three published lists. The first one is a project that generated stopwords with all possible suffixes and prefixes. The other two were published in (https://code.google.com/p/stop-words/source/browse/trunk/stop-words/stop-words/stop-words-arabic.txt) and (http://www.ranks.nl/stopwords/arabic) respectively. -A generated Egyptian-dialect list: this list is generated from the most frequent words in the corpora that can be a stopword in addition to the Egyptian dialect stopwords that appeared in the corpora.

-A combination of the Egyptian dialect list and the MSA list.
Text classification is applied on the three corpora using two feature selection techniques and two classifiers as shown in Fig. 2. We have used two well known supervised learning classifiers; Naïve Bayes (NB) in Ref. [22] and Decision tree (DT) in Ref. [23] in testing. There are many other kinds of supervised classifiers in the literature as in Ref. [24]. The two chosen classifiers represent two different families of classifiers. NB is one of the probabilistic classifiers which are the simplest and most commonly used classifier. DT on the other hand is a hierarchical decomposition of data space and doesn't depend on calculating probability. The test used two different feature selection (FS) techniques. These are; unigrams which depend on word presence; and bigrams as in Ref. [5].

STOPWORD LIST GENERATION
Stopwords are common words that generally do not contribute to the meaning of a sentence, specifically for the purposes of information retrieval and natural language processing. The common English words that don't affect the meaning of a sentence are like "a", "the", "of"…. Removing stopwords will reduce the corpus size without losing important information. In some corpora, specific words could not contribute to the meaning like the word "movie" in a movie reviews corpus but means something in news corpus. This word could be considered a stopword when analyzing the movie reviews corpus.
The common strategy for determining a stopword list is to calculate the frequency of appearance of each word in the document collection then to take the most frequent words. The selected terms are often hand-filtered for their semantic content relative to the domain of the documents being indexed, and marked as a stopword list.
The English stopword list is general and contains 127 words like (all, just, being…). In order to generate the stopword list for Arabic which is a very rich lexical language; we have done this through many steps. First, we should specify some general conditions for the word to be a stopword: -They give no meaning if they are used alone.
-They appear frequently in the text.
-They are general words and not used specifically in a certain field.
The methodology of generating the stopword lists are shown in Fig. 3. The methodology consists of three phases as illustrated in the following subsections.

A. Calculating words frequency
The three corpora are tokenized to words. This phase was done totally automatic using python code and the nltk 2.0 toolkit. The results are not totally meaningful as the tokenization could consider the "comma" as a word if it is not correctly used. There is some manual filtering after tokenization.
The reviews corpus give 3781 unique words, the Facebook corpus give 1451 unique words, and the Twitter corpus give 1160 unique words. This shows that despite the number of reviews are much less than the OSN corpora but they are lexically rich. After combining them together and removing the duplicates, the list of all words are 4818 words. Then we have calculated the frequency of occurrence of each word from the list of all words in the three corpora combined together.

B. The validity of words to be a stopword
To generate the corpus based list, we have taken the most frequent 200 words. These words are not all general and they are domain specific like the words ‫"اﻟﻤﺸﺎھﺪ"‬ or ‫"اﻟﻔﯿﻠﻢ"‬ which means (the spectator, the movie) respectively. This list contains words in MSA and Egyptian dialect as well.

Figure 3: A methodology of generating ED stopword list
Diacritics could change the meaning of a word i.e. the word ‫"اﻟﻤﺸﺎھﺪ"‬ could mean (the spectator or the scenes). The difference could be told through the meaning of the sentence. The OSN users use simple language without diacritics. Since the word is in the context of the corpora, it is more likely to appear frequently expressing both meanings. The problem will occur if a word appeared as a frequent word but outside the context of the corpora. This case didn't happen here.
To generate a general list of Egyptian dialect stopwords, we have taken the most frequent 200 words and remove the semantically recognized words which are likely to be nouns and verbs. Then, to generate a general list of Egyptian dialect, we have added every word in the corpora in Egyptian dialect to the most frequent words that are semantically meaningless. To validate if the word is a stopword or not; if the word is a MSA word we check its existence in the MSA stopword lists. If it doesn't exist, we check its corresponding meaning in the English stopword list. For example, the word ‫"ﻣﻦ"‬ exists in the MSA list but the word " ً ‫"ﺟﺪا‬ doesn't exist in MSA list but has a corresponding meaning in the English list which is "very". If the word is in Egyptian dialect, we see its correspondence in the MSA list and if doesn't exist we check its correspondent meaning in the English stopword list. For example the word ‫,"ﺑﺲ"‬ its correspondence in MSA is ‫"ﻓﻘﻂ"‬ and it has a corresponding meaning in the English stopword list too which is "only". On the contrary, the word ‫"ﻻزم"‬ has no correspondence in the MSA list which should be ‫"ﻻﺑﺪ"‬ but it has a correspondent meaning in the English list which is the word "should". Therefore, it is considered a stopword. The final list of valid unique words contains 100 words. This phase was done in a semi automatic way that includes manual check.

C. Adding possible prefixes and suffixes to the words
Arabic is a very rich lexical language which has a large number of prefixes and suffixes that could be added to a word to change its meaning. For example the prefix ‫"ال"‬ which means "the" change the word from indefinite to definite. The suffix ‫"ھﻢ"‬ gives the meaning of pronoun "them". We have added some frequent used prefixes to the words generated in both lists which are ( ‫ل‬ ‫ف،‬ ‫ب،‬ ‫و،‬ ‫ال،‬ ‫ك‬ ، ). If necessary we give pronoun suffixes which are ( ‫ﻧﺎ‬ ، ). We have added these suffixes to possession words in Egyptian dialect like the word ‫"ﺑﺘﺎﻋﻰ"‬ which means (mine).
There is also some letters are written in different forms so we write any word that contains these letters' possible forms such as ‫ي(‬ ‫,)ي،‬ ‫ة(‬ ‫,)ه،‬ ‫إ(‬ ‫أ،‬ ‫.)ا،‬ The last one is according to the word itself. The lists are manually revised for improper words or meaningless words.
After adding the prefixes and suffixes, the final general Egyptian dialect list contains 730 words.

EXPERIMENTAL SETUP AND RESULTS
We used a HP pavilion desktop computer of model: p6714me-m. The processor is Intel(R) core (TM) i5-2300 CPU @ 2.80 GHZ; RAM is 4GB; and 64-bit operating system. We have calculated the training time using a build-in function written with python code which calculates the processing time in terms of seconds. These tests were all performed using the Natural Language Toolkit (nltk 2.0) which is implemented inside python 3.1 as in Ref. [25].

A. Data Annotation
The reviews from the review site were previously rated from the site. They were given a degree from 1 to 10. The ratings bigger than 5 are considered positive and less than 5 are considered negative. The ratings equal to 5 are neutral. We have annotated the reviews according to the site rating.
For the OSN data, we have manually annotated the corpora. The manual annotation was more reliable as the human analyzing of data is better than the machine so far. Table I shows the number of positive, negative and neutral reviews, comments, and tweets resulted from annotation.

B. Classifiers Preparation
We trained Naive Bayes, and Decision Tree classifiers. The classifiers were conducted with the nltk 2.0 toolkit. There are some parameters passed in to the DT classifier can be tweaked to improve accuracy or decrease training time as in Ref. [25].
The parameters are: -Entropy cutoff: used during the tree refinement process. If the entropy of the probability distribution of label choices in the tree is greater than the entropy_cutoff, then the tree is refined further. But if the entropy is lower than the entropy_cutoff, then tree refinement is halted. Entropy is the uncertainty of the outcome. As entropy approaches 1.0, uncertainty increases and vice versa. Higher values of entropy_cutoff will decrease both accuracy and training time. It was set to '0.8'. -Depth cutoff: used during refinement to control the depth of the tree. The final decision tree will never be deeper than the depth_cutoff. Decreasing the depth_cutoff will decrease the training time and most likely decrease the accuracy as well. It was set to '5'. -Support cutoff: controls how many labeled feature sets are required to refine the tree. When the number of labeled feature sets is less than or equal to support_cutoff, refinement stops, at least for that section of the tree. Support_cutoff specifies the minimum number of instances that are required to make a decision about a feature. It was set to '30'.

C. Feature Selection
There are two Features selection (FS) techniques used in the test: -Unigram: treats the documents as group of words (Bag of Words (BOWs)) which constructs a word presence feature set from all the words of an instance. -Bigram: is the same as unigram but finds pair of words.

D. Results
We have made many experiments to test the effect of removing stopwords from different lists with the combination of two FS techniques and two classifiers with the three different corpora. We have divided each corpus data into three training folds. Each fold is the third of the original training data set which represents 75% of the original data. We evaluate each training model against the same test data which represents 25% of the original data. We report the results of averaging across the different folds per the various conditions in Table II. The standard Accuracy and Fmeasure were used to evaluate the performance for each test. The accuracy is defined as: the ratio of number of correctly classified reviews, comment, and tweets to the total number of data. F-measure is computed by: combining the Precision and Recall in the following way: where precision is defined as the ratio of number of correctly assigned category C to the total number of data classified as category C. Recall is the ratio of correctly assigned category C to the total number of data actually in category C. F-measure is computed for each category separately, these are represented in Table II as F-P for positive category and F-N for negative category. Table II shows that using the ED list or using it along with MSA give better accuracy and F-measure in most cases. The DT gives better results with Facebook data than NB as it is extremely unbalanced.

DISCUSSION
A. Corpora Analysis The number of neutral reviews from the review site represents 3% of the whole data. This is not a big number. We believe that people who write whole reviews on reviews sites are mainly having a complete opinion about the movie and they want to show it. They don't lean to be neutral. The number of positive reviews represents 78% of the whole data while the number of negative reviews represents 18% of the entire data. The data are obviously unbalanced since the movies were successful in this season, not many users' reviews were negative.
The number of neutral comments on Facebook represents 68% of the whole data. These are not neutral opinions on the movie. People who write in OSN are not neutral at all. The neutral comments are mainly objective sentences that don't contain any sentiments. Many comments were just debates between users. Some were expressing their personal feelings and some were using adjectives without specifying on whom or what. The number of positive comments represents 29% of the entire data and the number of negative comments represents 2% of the whole data which is an extremely small percentage. This is also an unbalanced data. We believe that people who access a movie page they do like it.
The number of neutral tweets represents 69% of the whole data. These are not neutral opinions on the movie too. The neutral tweets are mainly objective sentences that don't contain any sentiments. Many of the tweets were repetition of a dialogue from a movie without expressing any feelings. Others were tweets expressing the users' personal feelings like feeling excited to see the movie. The number of positive comments represents 20% of the entire data and the number of negative comments represents 9% of the whole data which is a small percentage. We believe that people who mention the movie in their tweets; do like it.
Using abbreviations and smiley faces in OSN are very frequent. There are some abbreviations were used also in Reviews. The meaning of these abbreviations and smiley faces were found from different sources on the web (Yahoo answers, Facebook emoticons sites) and translated to Arabic. For the Arabic abbreviations they were manually translated. Table III contains sample of Abbreviations and smiley faces found in the three corpora.

B. Specializations of Arabic Language
Words with the same meaning could be written in different correct ways like the words ‫ﺣﻨﺮوح"‬ ‫."ھﻨﺮوح،‬ They both give the future tense of the verb ‫"ﻧﺮوح"‬ which means "we will go". As we can notice three words in English are just written in one word in Arabic and give the same meaning. The pronouns in English are expressed in Arabic by adding a prefix letter that modify the verb especially when it is used in the middle of the sentence like ‫ﻧﺮوح"‬ ‫"اروح،‬ which means (I go, we go) respectively. Some prepositions and causal words are expressed in Arabic with one letter like the words ‫ﻻﻧﻰ"‬ ‫"اﻧﻰ،‬ which means (I am, because I am) respectively.
The many forms that the Arabic words could take are very common characteristics of MSA which make the dealing with the language is complicated. For DA, it is a tragedy. We have a special dialect for each Arab country and different dialects in the same country. For Egyptian dialect, there are many words that have no resemblance in MSA like the word ‫"ﻣﻔﯿﺶ"‬ which means (there is not). It has only a correspondent in MSA which is ‫ﯾﻮﺟﺪ"‬ ‫"ﻻ‬ which are complete different words. In the OSN corpora some other dialects appear like the Moroccan word ‫"ﺑﺰاف"‬ which means (too much) and the Syrian word ‫"ﻣﻠﯿﺢ"‬ which means (good). The number of other dialects in Facebook corpus represents 1% of the whole corpus which is very small percentage. The number of other dialects in Twitter corpus represents 0.5% of the whole corpus which is extremely small percentage. There were no other dialects in reviews corpus. They used a mix between MSA words and Egyptian dialect words as they are user reviews not formal reviews from critics.
The other phenomenon of Arab users is using the Franco-arab. This means that people use English letters for writing Arabic words like the word "de7k" which stands for ‫"ﺿﺤﻚ"‬ which means (laugh). The number of Franco-arab comments in Facebook corpus represents 18% of the whole corpus which is not a big percentage. The number of Franco-arab tweets in Twitter corpus represents 3% of the whole corpus which is a small percentage. However, we have to unify the language used for the classifier to perform well. These are not even English words that have meanings so; they must be rewritten in Arabic letter. We have used the website (www.yamli.com). They give variations for each word that have to be chosen from. Sometimes the users don't even write correct words in Francoarab. In this case the site translates the letters only which give funny Arabic words. This transformation was manually revised. Fig. 4 shows that using different stopword lists didn't change the accuracy except in case of using NB and bigram. The accuracy of removing stopwords from ED stopword list or combining it with MSA list give better result than using MSA stopword list alone. It also shows that that unigrams are better FS than bigrams with NB. Fig. 5 shows that the accuracy of DT is much bigger than NB because the data is extremely unbalanced. There is no significant difference between unigrams and bigrams in DT but unigrams is much better than bigrams with NB. Using lists containing ED stopwords increase the accuracy but don't change the accuracy with DT. Fig. 6 shows that the difference in accuracies of NB and DT is not as big as Facebook because the data is not very unbalanced but still DT give better performance. Using different lists didn't change the accuracy much but the general lists give good performance too.

C. Results Analysis
The OSN are not lexically rich as the review corpus. The lists containing Egyptian dialect stopwords give better results than using MSA stopwords only. The difference in performance between Facebook and Twitter data is due to the degree of imbalance. The nature of the data is the same but Facebook corpus is much more unbalanced than Twitter corpus.
Decision Tree is a hierarchical decomposition of data space and doesn't depend on calculating probability but Naïve Bayes depends on calculating probability for the whole data. Although NB usually gives higher accuracy than DT, but this was not the case when testing these corpora. This is due to the unbalance of the data as the positive class in these cases where much bigger than the negative class. NB calculates the probability on the whole data but DT is more specifically build hierarchy decomposition of data. That is why DT is better for unbalance data as it is more specific than NB. But still DT has longer processing time than NB because it builds the hierarchical decomposition on the whole data but the difference in time is not big as the data size was not so big. In NB tests, the accuracy is better when using unigram which is similar to what Ref. [5] has found. In DT tests, unigram and bigrams give nearly similar results.

CONCLUSION AND FUTURE WORK
In this paper, we have proposed a methodology for generating an Egyptian dialect stopword list from online social network (OSN) corpora and review site. The methodology consists of three phases: calculating the words' frequency of occurrence, check the validity of a word to be a stopword, and adding all possible prefixes and suffixes to the words generated. We compared it with MSA lists. The lists used in the comparison were: previously generated lists of MSA, the general generated list of Egyptian dialect, and a combination of the Egyptian dialect list with the MSA list.
The movie reviews topic was chosen to download data about movies from three different sources (Review site, Facebook, and Twitter). The data are extremely unbalanced as the movies were successful and most of the OSN users like it and the reviewers as well. The data contain many spams like advertising URLs, debates, and using of abbreviations and smiley faces. It needed many preprocessing and cleaning steps to be prepared for classification.
Applying removing stopwords with multiple lists shows that the general lists containing the Egyptian dialects words give better performance than using lists of MSA stopwords only. The results of Decision tree classifier are better than Naïve Bayes classifier for these kinds of corpora. Using unigrams give better results than bigrams.
In the future we plan to try more text processing techniques on ED OSN data like POS tagging and try to fulfill the gap of using the Arabic dialect in the OSN data as all resources are designed for MSA. We could tackle other dialects other than Egyptian.