Home   >   CSC-OpenAccess Library   >    Manuscript Information
Arabic Dialect Identification of Twitter Text Using PPM Compression
Mohammed Altamimi, William J. Teahan
Pages - 47 - 59     |    Revised - 31-10-2019     |    Published - 01-12-2019
Volume - 10   Issue - 4    |    Publication Date - December 2019  Table of Contents
Arabic Dialect Identification, Data Compression, Machine Learning, Natural Language Processing.
This paper explores the use of the Prediction by Partial Matching (PPM) compression scheme for Arabic dialect identification of Twitter text. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We present experimental results identifying single tweet and multiple author tweets from five major Arabic dialect regions: Gulf; Egyptian; Levantine; Maghrebi; and Iraqi; in addition to Modern Standard Arabic (MSA) and Classical Arabic (CA). We used the Bangor Twitter Arabic corpus (BTAC) which we built for dialect research. We also applied different machine learning algorithms such as Multinomial Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM) using several N-grams features. PPMD shows significantly better results in comparison to the other machine learning algorithms achieving 74.1% and 87.1% accuracy for single and multiple tweets dialect identification respectively.
Mr. Mohammed Altamimi
College of Computer Science and Engineering, University of Hail, Saudi Arabia - Saudi Arabia
Professor William J. Teahan
School of Computer Science and Electronic Engineering, University of Bangor, Bangor, United Kingdom - United Kingdom