Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus

Mubarak, Hamdy; Shaban, Kareem; Forat, Mohamed

doi:10.21608/ejle.2014.59857

Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus

Document Type : Original Article

Authors

¹ Arabic NLP Researche and Development Department, Sakhr Software Co.

² Arabic NLP Research and Development Department, Sakhr Software Co.

10.21608/ejle.2014.59857

Abstract

Part-Of-Speech (POS) tagging is a basic component necessary for many Natural Language Processing (NLP) applications. Building a manually tagged corpus helps in studying key statistics of a given language which form the basis for POS tagging systems. In this paper, we present both lexical and morphological statistics for Arabic that are derived from the Sakhr’s POS manually tagged corpus. It covers text (7 M words) from a wide range of Arab countries in different domains over the years 2002-2004. The derived statistics are used as heuristics and preferential rules within a statistical Diacritizer which achieves a high accuracy in stem diacritization and POS disambiguation. Statistics includes information related to sentence and word lengths, punctuation marks, distribution of Arabic letters and diacritics, in addition to lexical and morphological information for POS distribution, stems, prefixes, suffixes, roots, morphological patterns, and morphosyntactic features like gender, number, person, and case ending. Modern Standard Arabic (MSA) is studied by analyzing the coverage of stems, roots, morphological patterns, prefixes, and suffixes. Comparisons with an arbitrary English corpus are shown in applicable cases.

Keywords