Mubarak, H., Shaban, K., Forat, M. (2014). Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus. The Egyptian Journal of Language Engineering, 1(1), 24-41. doi: 10.21608/ejle.2014.59857
Hamdy Mubarak; Kareem Shaban; Mohamed Forat. "Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus". The Egyptian Journal of Language Engineering, 1, 1, 2014, 24-41. doi: 10.21608/ejle.2014.59857
Mubarak, H., Shaban, K., Forat, M. (2014). 'Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus', The Egyptian Journal of Language Engineering, 1(1), pp. 24-41. doi: 10.21608/ejle.2014.59857
Mubarak, H., Shaban, K., Forat, M. Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus. The Egyptian Journal of Language Engineering, 2014; 1(1): 24-41. doi: 10.21608/ejle.2014.59857
Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus
1Arabic NLP Researche and Development Department, Sakhr Software Co.
2Arabic NLP Research and Development Department, Sakhr Software Co.
Abstract
Part-Of-Speech (POS) tagging is a basic component necessary for many Natural Language Processing (NLP) applications. Building a manually tagged corpus helps in studying key statistics of a given language which form the basis for POS tagging systems. In this paper, we present both lexical and morphological statistics for Arabic that are derived from the Sakhr’s POS manually tagged corpus. It covers text (7 M words) from a wide range of Arab countries in different domains over the years 2002-2004. The derived statistics are used as heuristics and preferential rules within a statistical Diacritizer which achieves a high accuracy in stem diacritization and POS disambiguation. Statistics includes information related to sentence and word lengths, punctuation marks, distribution of Arabic letters and diacritics, in addition to lexical and morphological information for POS distribution, stems, prefixes, suffixes, roots, morphological patterns, and morphosyntactic features like gender, number, person, and case ending. Modern Standard Arabic (MSA) is studied by analyzing the coverage of stems, roots, morphological patterns, prefixes, and suffixes. Comparisons with an arbitrary English corpus are shown in applicable cases.