Analysis of Experiments on Statistical and Neural Parsing for a Morphologically Rich and Free Word Order Language Urdu

Research output: Contribution to journalArticleScientificpeer-review

9 Citations (Scopus)

Abstract

This article presents an analysis of experiments with statistical and neural parsing techniques for Urdu, a widely spoken South Asian language. We demonstrate state of the art constituency parsing results for an Urdu treebank. Urdu is a morphologically rich and is characterized by free word order. Language representation (e.g. input type, lemmatization, word clusters), part of speech tag set, phrase labels and the size of a training corpus are crucial for parsing such languages. In this article, probabilistic context-free grammars, data-oriented parsing, and recursive neural network based models have been experimented with several linguistic features which show improvements in the parsing results. Features include syntactic sub-categorization of POS tags, empirically learned horizontal and vertical markovizations and lexical head words. These features enable dependency information for case markers and add phrasal and lexical context to the parse trees. The data-oriented parsing and recursive neural network model give an f-score of 87.1 by considering gold POS tags in the test set, on textual input, they show a performance with f-scores of 83.4 and 84.2, respectively. To overcome the issue of data sparsity due to the morphological richness, lemmatization and unsupervised word clustering have been performed. A treebank should cover most probable word orders of the language so that models can learn various orders accurately. To analyze the order coverage of the treebank and learning capability of different parsers, a test set has been prepared conditioning different word orders. This test set is evaluated with the best performing parsing models and with gold POS tags, f-scores are above 90 and on textual input, the average f-score is 87.6.
Original languageEnglish
Pages (from-to)161776-161793
Number of pages18
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 2019
MoE publication typeA1 Journal article-refereed

Keywords

  • Urdu
  • free word-order
  • morphological-richness
  • statistical parsing
  • treebank

Fingerprint

Dive into the research topics of 'Analysis of Experiments on Statistical and Neural Parsing for a Morphologically Rich and Free Word Order Language Urdu'. Together they form a unique fingerprint.

Cite this