• Home
  • Browse
    • Current Issue
    • By Issue
    • By Author
    • By Subject
    • Author Index
    • Keyword Index
  • Journal Info
    • About Journal
    • Aims and Scope
    • Editorial Board
    • Publication Ethics
    • Peer Review Process
  • Guide for Authors
  • Submit Manuscript
  • Contact Us
 
  • Login
  • Register
Home Articles List Article Information
  • Save Records
  • |
  • Printable Version
  • |
  • Recommend
  • |
  • How to cite Export to
    RIS EndNote BibTeX APA MLA Harvard Vancouver
  • |
  • Share Share
    CiteULike Mendeley Facebook Google LinkedIn Twitter
The Egyptian Journal of Language Engineering
arrow Articles in Press
arrow Current Issue
Journal Archive
Volume Volume 11 (2024)
Volume Volume 10 (2023)
Volume Volume 9 (2022)
Volume Volume 8 (2021)
Volume Volume 7 (2020)
Volume Volume 6 (2019)
Volume Volume 5 (2018)
Volume Volume 4 (2017)
Volume Volume 3 (2016)
Volume Volume 2 (2015)
Volume Volume 1 (2014)
Issue Issue 2
Issue Issue 1
Mubarak, H., Shaban, K., Forat, M. (2014). Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus. The Egyptian Journal of Language Engineering, 1(1), 24-41. doi: 10.21608/ejle.2014.59857
Hamdy Mubarak; Kareem Shaban; Mohamed Forat. "Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus". The Egyptian Journal of Language Engineering, 1, 1, 2014, 24-41. doi: 10.21608/ejle.2014.59857
Mubarak, H., Shaban, K., Forat, M. (2014). 'Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus', The Egyptian Journal of Language Engineering, 1(1), pp. 24-41. doi: 10.21608/ejle.2014.59857
Mubarak, H., Shaban, K., Forat, M. Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus. The Egyptian Journal of Language Engineering, 2014; 1(1): 24-41. doi: 10.21608/ejle.2014.59857

Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus

Article 3, Volume 1, Issue 1, January 2014, Page 24-41  XML PDF (790.42 K)
Document Type: Original Article
DOI: 10.21608/ejle.2014.59857
View on SCiNiTO View on SCiNiTO
Authors
Hamdy Mubarak email 1; Kareem Shaban1; Mohamed Forat2
1Arabic NLP Researche and Development Department, Sakhr Software Co.
2Arabic NLP Research and Development Department, Sakhr Software Co.
Abstract
Part-Of-Speech (POS) tagging is a basic component necessary for many Natural Language Processing (NLP) applications. Building a manually tagged corpus helps in studying key statistics of a given language which form the basis for POS tagging systems. In this paper, we present both lexical and morphological statistics for Arabic that are derived from the Sakhr’s POS manually tagged corpus. It covers text (7 M words) from a wide range of Arab countries in different domains over the years 2002-2004. The derived statistics are used as heuristics and preferential rules within a statistical Diacritizer which achieves a high accuracy in stem diacritization and POS disambiguation. Statistics includes information related to sentence and word lengths, punctuation marks, distribution of Arabic letters and diacritics, in addition to lexical and morphological information for POS distribution, stems, prefixes, suffixes, roots, morphological patterns, and morphosyntactic features like gender, number, person, and case ending. Modern Standard Arabic (MSA) is studied by analyzing the coverage of stems, roots, morphological patterns, prefixes, and suffixes. Comparisons with an arbitrary English corpus are shown in applicable cases.
Keywords
Corpus Statistics; Arabic NLP; POS Tagging; Diacritization; MSA
Statistics
Article View: 186
PDF Download: 485
Home | Glossary | News | Aims and Scope | Sitemap
Top Top

Journal Management System. Designed by NotionWave.