Home   >   CSC-OpenAccess Library   >    Manuscript Information
Arabic Dialect Identification of Twitter Text Using PPM Compression
Mohammed Altamimi, William J. Teahan
Pages - 47 - 59     |    Revised - 31-10-2019     |    Published - 01-12-2019
Volume - 10   Issue - 4    |    Publication Date - December 2019  Table of Contents
Arabic Dialect Identification, Data Compression, Machine Learning, Natural Language Processing.
This paper explores the use of the Prediction by Partial Matching (PPM) compression scheme for Arabic dialect identification of Twitter text. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We present experimental results identifying single tweet and multiple author tweets from five major Arabic dialect regions: Gulf; Egyptian; Levantine; Maghrebi; and Iraqi; in addition to Modern Standard Arabic (MSA) and Classical Arabic (CA). We used the Bangor Twitter Arabic corpus (BTAC) which we built for dialect research. We also applied different machine learning algorithms such as Multinomial Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM) using several N-grams features. PPMD shows significantly better results in comparison to the other machine learning algorithms achieving 74.1% and 87.1% accuracy for single and multiple tweets dialect identification respectively.
1 Google Scholar 
2 refSeek 
3 Doc Player 
4 Scribd 
Abu Kwaik, K. et al., 2018. Shami: A Corpus of Levantine Arabic Dialects. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).
Alkhazi, I.S. & Teahan, W.J., 2019. Compression-Based Parts-of-Speech Tagger for The Arabic. International Journal of Computational Linguistics (IJCL), 10(1).
Alshutayri, A.O.O. & Atwell, E., 2017. Exploring Twitter as a Source of an Arabic Dialect Corpus. International Journal of Computational Linguistics (IJCL), 8(2), pp.37-44.
Altamimi, M. & Teahan, W.J., 2017. Gender And Authorship Categorisation Of Arabic Text From Twitter Using PPM.
Altamimi, M., Alruwaili, O. & Teahan, W.J., 2018. BTAC: A Twitter Corpus for Arabic Dialect Identification. In of the 6th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-corpora 2018). p. 5.
Bouamor, H., Habash, N. & Oflazer, K., 2014. A Multidialectal Parallel Corpus of Arabic. In LREC. pp. 1240-1245.
Chang, C.-C. & Lin, C.-J., 2011. LIBSVM: a Library for Support Vector Machines. ACM transactions on intelligent systems and technology (TIST), 2(3), p.27.
Cleary, J. & Witten, I., 1984. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4), pp.396-402.
Cleary, J.G. & Teahan, W.J., 1997. Unbounded length contexts for PPM. The Computer Journal, 40(2 and 3), pp.67-75.
Darwish, K., Sajjad, H. & Mubarak, H., 2014. Verifiably effective arabic dialect identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1465-1468.
El Haj, M., Rayson, P.E. & Aboelezz, M., 2018. Arabic Dialect Identification in the Context of Bivalency and Code-Switching. Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan.. European Language Resources Association.
Elfardy, H. & Diab, M., 2013. Sentence level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 456-461.
Hall, M. et al., 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), pp.10-18.
Hamdi, A. et al., 2015. POS-tagging of tunisian dialect using standard arabic resources and tools. In Workshop on Arabic Natural Language Processing. pp. 59-68.
Harrat, S., Meftouh, K. & Smaili, K., 2017. Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid. In 18th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING).
Howard, P.G., 1993. The Design and Analysis of E cient Lossless Data Compression Systems.
Kumar, R. et al., 2018. Automatic Identification of Closely-related Indian Languages: Resources and Experiments. arXiv preprint arXiv:1803.09405.
Ljube�ic, N. & Kranjcic, D., 2015. Discrimina ing be ween cl sely rela ed languages n twitter. Informatica, 39(1).
Ljubesic, N., Mikelic, N. & Boras, D., 2007. Language indentification: How to distinguish similar languages? In Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on. IEEE, pp. 541-546.
Lui, M. & Cook, P., 2013. Classifying English documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013). pp. 5-15.
Malmasi, S., Refaee, E. & Dras, M., 2015. Arabic dialect identification using a parallel multidialectal corpus. In International Conference of the Pacific Association for Computational Linguistics. Springer, pp. 35-53.
Moffat, A., 1990. Implementing the PPM data compression scheme. IEEE Transactions on communications, 38(11), pp.1917-1921.
Sadat, F., Kazemi, F. & Farzindar, A., 2014. Automatic identification of arabic dialects in social media. In Proceedings of the first international workshop on Social media retrieval and analysis. ACM, pp. 35-40.
Tiedemann, J. & Ljubešic, N., 2012. Efficien discrimina i n be ween cl sely rela ed languages. Proceedings of COLING 2012, pp.2619-2634.
Witten, I.H., Neal, R.M. & Cleary, J.G., 1987. Arithmetic coding for data compression. Communications of the ACM, 30(6), pp.520-540.
Wu, P. & Teahan, W.J., 2008. A new PPM variant for Chinese text compression. Natural Language Engineering, 14(03), pp.417-430.
Xu, F., Wang, M. & Li, M., 2017. Sentence-level dialects identification in the Greater China region. arXiv preprint arXiv:1701.01908.
Zaidan, O.F. & Callison-Burch, C., 2011. The arabic online commentary dataset: an annotated dataset of informal arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, pp. 37-41.
Zaidan, O.F. & Callison-Burch, C., 2014. Arabic dialect identification. Computational Linguistics, 40(1), pp.171-202.
Zampieri, M. & Gebre, B.G., 2012. Automatic identification of language varieties: The case of Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing. Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI), pp. 233-237.
Zampieri, M., Gebre, B.G. & Diwersy, S., 2013. N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Iden ifica in de Vari s de l'Espagn l)[in French]. Proceedings of TALN 2013 (Volume 2: Short Papers), 2, pp.580-587.
[24] Teahan, W.J. & Cleary, J.G., 1997. Models of English text. In Proceedings DCC'97. Data Compression Conference. IEEE, pp. 12-21.
Mr. Mohammed Altamimi
College of Computer Science and Engineering, University of Hail, Saudi Arabia - Saudi Arabia
Professor William J. Teahan
School of Computer Science and Electronic Engineering, University of Bangor, Bangor, United Kingdom - United Kingdom