Home > CSC-OpenAccess Library > Manuscript Information
EXPLORE PUBLICATIONS BY COUNTRIES |
![]() |
| EUROPE | |
| MIDDLE EAST | |
| ASIA | |
| AFRICA | |
| ............................. | |
| United States of America | |
| United Kingdom | |
| Canada | |
| Australia | |
| Italy | |
| France | |
| Brazil | |
| Germany | |
| Malaysia | |
| Turkey | |
| China | |
| Taiwan | |
| Japan | |
| Saudi Arabia | |
| Jordan | |
| Egypt | |
| United Arab Emirates | |
| India | |
| Nigeria | |
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Ibrahim S. Alkhazi, William J. Teahan
Pages - 1 - 15 | Revised - 31-03-2019 | Published - 30-04-2019
MORE INFORMATION
KEYWORDS
Natural Language Processing, Arabic Part-of-Speech Tagger, Hidden Markov Model, Statistical Language Model.
ABSTRACT
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
| Abdelali, Ahmed, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. �Farasa: A Fast and Furious Segmenter for Arabic.� Pp. 11�16 in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. | |
| Abumalloh, Rabab Ali, Hassan Maudi Al-Sarhan, Othman Ibrahim, and Waheeb Abu-Ulbeh. 2016. �Arabic Part-of-Speech Tagging.� Journal of Soft Computing and Decision Support Systems 3(2):45�52. | |
| Al Shamsi, Fatma and Ahmed Guessoum. 2006. �A Hidden Markov Model-Based POS Tagger for Arabic.� Pp. 31�42 in Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France. | |
| Al-Harbi, S., A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh. 2008. �Automatic Arabic Text Classification.� in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data. | |
| Al-Kazaz, Noor R., Sean A. Irvine, and William J. Teahan. 2016. �An Automatic Cryptanalysis of Transposition Ciphers Using Compression.� Pp. 36�52 in International Conference on Cryptology and Network Security. | |
| Alabbas, Maytham and Allan Ramsay. 2012. �Improved POS-Tagging for Arabic by Combining Diverse Taggers.� Pp. 107�16 in IFIP International Conference on Artificial Intelligence Applications and Innovations. | |
| Alghamdi, Mansoor A., Ibrahim S. Alkhazi, and William J. Teahan. 2016. �Arabic OCR Evaluation Tool.� Pp. 1�6 in Computer Science and Information Technology (CSIT), 2016 7th International Conference on. IEEE. | |
| Alhawiti, Khaled M. 2014. �Adaptive Models of Arabic Text.� Ph.D. thesis, Bangor University. | |
| Alkahtani, Saad and William J. Teahan. 2016. �A New Parallel Corpus of Arabic/English.� Pp. 279�84 in Proceedings of the Eighth Saudi Students Conference in the UK. | |
| Alkahtani, Saad. 2015. �Building and Verifying Parallel Corpora between Arabic and English.� Ph.D. thesis, Bangor University. | |
| Alkhazi, Ibrahim S. and William J. Teahan. 2017. �Classifying and Segmenting Classical and Modern Standard Arabic Using Minimum Cross-Entropy.� INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 8(4):421�30. | |
| Alkhazi, Ibrahim S. and William J. Teahan. 2018. �BAAC: Bangor Arabic Annotated Corpus.� INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 9(11):131�40. | |
| Alkhazi, Ibrahim S., Mansoor A. Alghamdi, and William J. Teahan. 2017. �Tag Based Models for Arabic Text Compression.� Pp. 697�705 in 2017 Intelligent Systems Conference (IntelliSys). IEEE. | |
| Alosaimy, Abdulrahman Mohammed S. 2018. �Ensemble Morphosyntactic Analyser for Classical Arabic.� Ph.D. thesis, University of Leeds. | |
| Alqrainy, Shihadeh. 2008. �A Morphological-Syntactical Analysis Approach for Arabic Textual Tagging.� | |
| Anon. n.d. �Madamira Arabic Analyzer - Online.� Retrieved February 17, 2019a (https://camel.abudhabi.nyu.edu/madamira/). | |
| Anon. n.d. �The Stanford Natural Language Processing Group.� Retrieved February 17, 2019b (https://nlp.stanford.edu/software/tagger.shtml). | |
| Atwell, Eric Steven, Salim Elsheikh, and Mohammad Elsheikh. 2018. �TIMELINE OF THE DEVELOPMENT OF ARABIC POS TAGGERS AND MORPHOLOGICALANALYSERS.� | |
| Brill, Eric. 1992. �A Simple Rule-Based Part of Speech Tagger.� Pp. 152�55 in Proceedings of the third conference on Applied natural language processing. | |
| Brown, Peter F., Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. �An Estimate of an Upper Bound for the Entropy of English.� Computational Linguistics 18(1):31�40. | |
| Cleary, John and Witten, Ian. 1984. �Data Compression Using Adaptive Coding and Partial String Matching.� C(4):396�402. | |
| Columbia University. n.d. �Arabic Language Disambiguation for Natural Language Processing Applications - Cu14012 - Columbia Technology Ventures.� Retrieved (http://innovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-natural-language-processing-applications). | |
| Darwish, Kareem, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy, and Laura Kallmeyer. 2018. �Multi-Dialect Arabic POS Tagging: A CRF Approach.� in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). | |
| Diab, Mona T. 2007. �Improved Arabic Base Phrase Chunking with a New Enriched POS Tag Set.� Pp. 89�96 in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources. | |
| Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2004. �Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks.� Pp. 149�52 in Proceedings of HLT-NAACL 2004: Short papers. | |
| Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2007. �Automatic Processing of Modern Standard Arabic Text.� Pp. 159�79 in Arabic Computational Morphology. Springer. | |
| El Hadj, Yahya, I. Al-Sughayeir, and A. Al-Ansari. 2009. �Arabic Part-of-Speech Tagging Using the Sentence Structure.� in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt. | |
| El-Kareh, Seham and Sameh Al-Ansary. 2000. �An Interactive Multi-Features POS Tagger.� P. 83Y88 in the Proceedings of the International Conference on Artificial and Computational Intelligence for Decision Control and Automation in Intelligence for Decision Control and Automation in Engineering and Industrial Applications. | |
| Francis, W. Nelson and Henry Kucera. 1979. �The Brown Corpus: A Standard Corpus of Present-Day Edited American English.� Providence, RI: Department of Linguistics, Brown University [Producer and Distributor]. | |
| Green, Spence and Cd Manning. 2010. �Better Arabic Parsing: Baselines, Evaluations, and Analysis.� COLING �10 Proceedings of the 23rd International Conference on Computational Linguistics (August):394�402. | |
| Green, Spence, Marie-Catherine de Marneffe, and Christopher D. Manning. 2013. �Parsing Models for Identifying Multiword Expressions.� Computational Linguistics 39(1):195�227. | |
| Greene, Barbara B. and Gerald M. Rubin. 1971. �Automated Grammatical Tagging of English.� | |
| Habash, Nizar and Owen Rambow. 2005. �Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop.� Pp. 573�80 in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. | |
| Habash, Nizar, Owen Rambow, and Ryan Roth. 2009. �MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization.� Pp. 102�9 in Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt. | |
| Habash, Nizar, Ryan Roth, Owen Rambow, Ramy Eskander, and Nadi Tomeh. 2013. �Morphological Analysis and Disambiguation for Dialectal Arabic.� Pp. 426�32 in Hlt-Naacl. | |
| Hadni, Meryeme, Said Alaoui Ouatik, Abdelmonaime Lachkar, and Mohammed Meknassi. 2013. �Hybrid Part-of-Speech Tagger for Non-Vocalized Arabic Text.� International Journal on Natural Language Computing (IJNLC) Vol 2. | |
| Hajic, Jan, Otakar Smrz, Petr Zem�nek, Jan �naidauf, and Emanuel Be�ka. 2004. �Prague Arabic Dependency Treebank: Development in Data and Tools.� Pp. 110�17 in Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools. | |
| Jelinek, Fred. 1990. �Self-Organized Language Modeling for Speech Recognition.� Readings in Speech Recognition 450�506. | |
| Katz, Slava. 1987. �Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.� IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3):400�401. | |
| Khmelev, Dmitry V and William J. Teahan. 2003. �A Repetition Based Measure for Verification of Text Collections and for Text Categorization.� Pp. 104�10 in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. | |
| Khoja, Shereen, Roger Garside, and Gerry Knowles. 2001. �A Tagset for the Morphosyntactic Tagging of Arabic.� Proceedings of the Corpus Linguistics. Lancaster University (UK) 13. | |
| Khoja, Shereen. 2001. �APT: Arabic Part-of-Speech Tagger.� Pp. 20�25 in Proceedings of the Student Workshop at NAACL. | |
| Khoja, Shereen. 2003. �APT: An Automatic Arabic Part-of-Speech Tagger.� Ph.D. thesis, Lancaster University. | |
| Klein, Sheldon and Robert F. Simmons. 1963. �A Computational Approach to Grammatical Coding of English Words.� Journal of the ACM (JACM) 10(3):334�47. | |
| Kuhn, Roland and Renato De Mori. 1990. �A Cache-Based Natural Language Model for Speech Recognition.� IEEE Transactions on Pattern Analysis and Machine Intelligence 12(6):570�83. | |
| Linguistic Data Consortium. 2002. Buckwalter Arabic Morphological Analyzer?: Version 1.0. Linguistic Data Consortium. | |
| Maamouri, Mohamed and Ann Bies. 2004. �Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools.� Pp. 2�9 in Proceedings of the Workshop on Computational Approaches to Arabic Script-based languages. | |
| Martinez, Angel R. 2012. �Part-of-Speech Tagging.� Wiley Interdisciplinary Reviews: Computational Statistics 4(1):107�13. | |
| Mohamed, Emad and Sandra K�bler. 2010. �Arabic Part of Speech Tagging.� in LREC. | |
| Nguyen, Dat Quoc, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2014. �RDRPOSTagger: A Ripple down Rules-Based Part-of-Speech Tagger.� Pp. 17�20 in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. | |
| nltk.org. n.d. �Simple Pipeline Architecture for an Information Extraction System.� Retrieved February 8, 2019 (http://www.nltk.org/book/ch07.html). | |
| Pasha, Arfath, Mohamed Al-badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. �MADAMIRA?: A Fast , Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic.� Proceedings of the 9th Language Resources and Evaluation Conference (LREC�14) 14:1094�1101. | |
| Richards, Debbie. 2009. �Two Decades of Ripple down Rules Research.� The Knowledge Engineering Review 24(2):159�84. | |
| Soudi, Abdelhadi, Ali Farghaly, G�nter Neumann, and Rabih Zbib. 2012. Challenges for Arabic Machine Translation. Vol. 9. John Benjamins Publishing. | |
| Taylor, Ann, Mitchell Marcus, and Beatrice Santorini. 2003. �The Penn Treebank: An Overview.� Pp. 5�22 in Treebanks. Springer. | |
| Teahan, W. J. and John G. Cleary. 1998. �Tag Based Models of English Text.� Pp. 43�52 in Data Compression Conference. IEEE. | |
| Teahan, William J. and John G. Cleary. 1997. �Applying Compression to Natural Language Processing.� in SPAE: The Corpus of Spoken Professional American-English. | |
| Teahan, William J., Yingying Wen, Rodger McNab, and Ian H. Witten. 2000. �A Compression-Based Algorithm for Chinese Word Segmentation.� Computational Linguistics 26(3):375�93. | |
| Teahan, William John, Stuart Inglis, John G. Cleary, and Geoffrey Holmes. 1998. �Correcting English Text Using PPM Models.� Pp. 289�98 in Data Compression Conference, 1998. DCC�98. Proceedings. | |
| Teahan, William John. 1998. �Modelling English Text.� Ph.D. thesis, Waikato University. | |
| Teahan, William John. 2000. �Text Classification and Segmentation Using Minimum Cross-Entropy.� Pp. 943�61 in Content-Based Multimedia Information Access-Volume 2. | |
| Teahan, William. 2018. �A Compression-Based Toolkit for Modelling and Processing Natural Language Text.� Information 9(12):294. | |
| Tim Buckwalter. n.d. �Buckwalter Arabic Transliteration.� Retrieved January 29, 2019 (http://www.qamus.org/transliteration.htm). | |
| Wintner, Shuly. 2014. �Morphological Processing of Semitic Languages.� Pp. 43�66 in Natural language processing of Semitic languages. Springer. | |
| Wu, Peiliang. 2007. �Adaptive Models of Chinese Text.� University of Wales, Bangor. | |
Mr. Ibrahim S. Alkhazi
College of Computers & Information Technology
Tabuk University
Tabuk, Saudi Arabia - Saudi Arabia
i.alkhazi@ut.edu.sa
Dr. William J. Teahan
School of Computer Science and Electronic Engineering
Bangor University
United Kingdom - United Kingdom
|
|
|
|
| View all special issues >> | |
|
|



