Exploring Twitter as a Source of an Arabic Dialect Corpus
Areej Odah Alshutayri, Eric Atwell
Pages - 37 - 44     |    Revised - 30-04-2017     |    Published - 01-06-2017
Volume - 8   Issue - 2    |    Publication Date - June 2017  Table of Contents
Dialectal Arabic, Phonological Variations, Social Media, Multi Dialect, Twitter, Tweet.
Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.
Mrs. Areej Odah Alshutayri
Faculty of Computing and Information Technology King Abdul Aziz University Jeddah, Saudi Arabia and School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom
Associate Professor Eric Atwell
School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom