Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

This is an Open Access publication published under CSC-OpenAccess Policy.
Joint Alignment of Segmentation and Labelling for Arabic Morphosyntactic Taggers
Abdulrahman Alosaimy, Eric Atwell
Pages - 1 - 12     |    Revised - 31-01-2018     |    Published - 30-04-2018
Volume - 9   Issue - 1    |    Publication Date - April 2018  Table of Contents
Arabic, POS-Tagging, Segmentation, Tokenisation, Morphological Alignment.
We present and compare three methods of alignment between morphemes resulting from four different Arabic POS-taggers as well as one baseline method using only provided labels. We combined four Arabic POS-taggers: MADAMIRA (MA), Stanford Tagger (ST), AMIRA (AM), Farasa (FA); and as the target output used two Classical Arabic gold standards: Quranic Arabic Corpus (QAC) and SALMA Standard Arabic Linguistics Morphological Analysis (SAL). We justify why we opt to use label for aligning instead of word form. The problem is not trivial as it is tackling six different tokenisation and labelling standards. The supervised learning using a unigram model scored the best segment alignment accuracy, correctly aligning 97% of morpheme segments. We then evaluated the alignment methods extrinsically, in terms of their effect in improving accuracy of ensemble POS-taggers, merging different combinations of the four Arabic POS-taggers. Using the best approach to align input POS taggers, ensemble tagger has correctly segmented and tagged 88.09% of morphemes. We show how increasing the number of input taggers raise the accuracy, suggesting that input taggers make different errors.
CITED BY (0)  
1 Google Scholar
2 BibSonomy
3 ResearchGate
4 Doc Player
5 White Rose Research Online
6 Scribd
7 SlideShare
1 Paroubek, P. "Evaluating Part-of-Speech Tagging and Parsing Patrick Paroubek". Evaluation of Text and Speech Systems, 2007.
2 Needleman, S.B., and C.D. Wunsch. "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins". Journal of Molecular Biology. vol. 48, 1970, pp. 443-53.
3 Pasha, A., M. Al-Badrashiny, M. Diab, A. El Kholy, R. Eskander, N. Habash, and others. "Madamira: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic". in Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014.
4 Toutanova, K., D. Klein, and C.D. Manning. "Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network". In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (NAACL '03), 2003, pp. 252-59.
5 Diab, M. "Second Generation AMIRA Tools for Arabic Processing: Fast and Robust Tokenization, POS Tagging, and Base Phrase Chunking". ed. by Khalid Choukri and Bente Maegaard. Conference on Arabic Language Resources and Tools, 2009pp. 285-88.
6 Zhang, Y., C. Li, R. Barzilay, and K. Darwish. "Randomized Greedy Inference for Joint Segmentation, POS Tagging and Dependency Parsing". Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 42-52.
7 Dukes, K., E. Atwell, and N. Habash. "Supervised Collaboration for Syntactic Annotation of Quranic Arabic". Language Resources and Evaluation, 2013.
8 Sawalha, M., E. Atwell, and M. a M. Abushariah. "SALMA: Standard Arabic Language Morphological Analysis". 2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA, 2013, 2013.
9 Adda, G., J. Mariani, J. Lecomte, P. Paroubek, and M. Rajman. "The GRACE French Part-of-Speech Tagging Evaluation Task.". International Conference on Language Resources and Evaluation, Granada, May. vol. 1 1998, pp. 433-441.
10 Hughes, J., C. Souter, and E. Atwell. "Automatic Extraction of Tagset Mappings from Parallel-Annotated Corpora", 1995, pp. 8.
11 Atwell, E., J. Hughes, and C. Souter. "AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models". Proceedings of ACL Workshop on The Balancing Act: Combining Symbolic and Statistical Approaches to Language, 1994, pp. 11-20.
12 M. Kurimo S. Virpioja, V.T.E.A. "Overview and Results of Morpho Challenge 2009". Access Evaluation, 2009.
13 Alabbas, M.A.S. "Textual Entailment for Modern Standard Arabic", 2013.
14 Dyer, C., V. Chahuneau, and N.A. Smith. "A Simple, Fast, and Effective Reparameterization of Ibm Model 2", 2013.
15 Katz, S., L. Lamel, and G. Adda. "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer". IEEE Transactions on Acoustics, Speech, and Signal Processing. vol. 35, 1987,pp. 400-401.
16 Breiman, L. "Random Forests". Machine Learning. vol. 45, 2001, pp. 5-32.
17 Alashqar, A.M. "A Comparative Study on Arabic POS Tagging Using Quran Corpus". Informatics and Systems (INFOS), 2012, pp. NLP-29-NLP-33.
Mr. Abdulrahman Alosaimy
University of Leeds - United Kingdom
Professor Eric Atwell
- United Kingdom