Modern Standard Arabic Grammar Automatic Extraction from Penn 1 Arabic Treebank Using Natural Language Toolkit

Document Type : Original Article

Authors

1 Phonetics and Linguistics Department, Faculty of Arts, Alexandria University

2 Department of Phonetics and Linguistics and the head of Phonetics and Linguistics Department, Faculty of Arts, Alexandria University

Abstract

This paper presents a methodology for rule based bottom up parsing technique forModern Standard Arabic (MSA) in
Context Free Grammar (CFG) formalism in Phrase Structure Grammar (PSG) representation, where the grammar is
automatically extracted from a syntactically annotated corpus.The extracted grammar is used to build an automatic lexicon and
grammar rules module. Furthermore, the extracted CFG is further transformed into Probabilistic Context Free Grammar (PCFG)
that could be used in a hybrid approach, which is also calculated automatically. The used corpus is the Penn Arabic
Treebank(PATB)and algorithm implementation is performed with Natural Language Processing Toolkit (NLTK).The parser
showed that automatic extraction of grammar improved the grammar building phase in both coverage of structures and time
needed, but still needs further manual constrains addition. Automatic extraction of grammar is able to enhance rule based
grammar parsers and it will enable a new paradigm of statistically directed symbolic parsing.

Keywords