Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(297.99KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
Exploring Twitter as a Source of an Arabic Dialect Corpus
Areej Odah Alshutayri, Eric Atwell
Pages - 37 - 44     |    Revised - 30-04-2017     |    Published - 01-06-2017
Volume - 8   Issue - 2    |    Publication Date - June 2017  Table of Contents
MORE INFORMATION
KEYWORDS
Dialectal Arabic, Phonological Variations, Social Media, Multi Dialect, Twitter, Tweet.
ABSTRACT
Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.
CITED BY (0)  
1 Google Scholar
2 BibSonomy
3 ResearchGate
4 White Rose Research Online
5 Scribd
6 SlideShare
1 F. Biadsy, J. Hirschberg, N. Habash. (2009). "Spoken Arabic dialect identification using phonotactic modeling". In: Proceedings of the EACL workshop on computational approaches to Semitic languages, pp. 53-61, 31 March, Athens, Greece. ACL, Stroudsburg, PA, USA.
2 N. Habash. (2010). "Introduction to Arabic natural language processing". Morgan & Claypool Publishers, Synthesis Lectures on Human Language Technology. 10, ebook isbn 978-1-59829-796-6.
3 F. Alorifi. (2008). "Automatic identification of Arabic dialects using Hidden Markov Models". PhD thesis, University of Pittsburgh, Department of Electrical Engineering and Computer Science.
4 F. Sadat, F. Kazemi, and A. Farzindar. (2014). "Automatic identification of arabic language varieties and dialects in social media". In Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP), pages 22-27.
5 U. Horesh and W. M. Cotter. (2016). "Current research on linguistic variation in the arabic-speaking world". Language and Linguistics Compass, 10(8):370-381.
6 M. Saloot, N. Idris, A. Aw, and D. Thorleuchter. (2016). "Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)". Digital Scholarship in the Humanities, 31(2), pp.227-243.
7 K. Almeman, M. Lee, and A. Almiman. (2013). "Multi Dialect Arabic Speech Parallel Corpora". In: Communications, Signal Processing, and their Applications (ICCSPA), 1st International Conference, Sharjah, UAE. IEEE.
8 H. Mubarak, K. Darwish. (2014). "Using Twitter to collect a multi-dialectal corpus of Arabic". In: Proceedings of the EMNLP workshop on natural language processing. Doha, Qatar, 25 October, 2014, pp. 1-7.
9 E. Nagoudi, and D. Schwab. (2017). "Semantic Similarity of Arabic Sentences with Word Embeddings". Association for Computational Linguistics. pp.18-24. [workshop publication]. Available from: http://aclweb.org/anthology/W17-1303.
10 O. Zaidan, C. Callison-Burch. (2014). "Arabic dialect identification". In: Computational Linguistics. 40(1): pp. 171-202.
11 A. Ali, H. Mubarak, and S. Vogel. (2014). "Advances in Dialectal Arabic speech recognition". In: Proceedings of the of the international workshop on spoken language translation (IWSLT) Dec 4-5, Lake Tahoe CA, USA. pp.156-162.
12 M. Elmahdy, R. Gruhn, W. Minker, S. Abdennadher. (2009). "Cross-lingual acoustic modeling for Dialectal Arabic speech recognition". In: ACM SIGKDD Explorations Newsletter 11(1):101-118, November 2009.
13 K. Almeman, M. Lee. (2013). "Automatic building of Arabic multi-dialect text corpora by bootstrapping dialect words". In: The Proceedings of the 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA'13), Sharjah, UAE, 12-14 Feb., IEEE.
14 M. Khoshaba. (2006). "Iraqi dialect vs. Standard Arabic", Medium Corporation, San Jose, CA, USA.
15 M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, H. Witten. (2009). "The WEKA Data Mining Software: An update". In ACM SIGKDD Explorations Newsletter, 11(1): pp. 10-18, November 2009.
16 M. Alrabiah, N. Alhelewh, A. Al-Salman, E. Atwell. (2014). "An Empirical Study On The Holy Quran Based On A Large Classical Arabic Corpus". International Journal of Computational Linguistics 5(1):pp.1-13.
17 M. Alrabiah, A. Al-Salman, E. Atwell, N. Alhelewh. (2014). "KSUCCA: A Key To Exploring Arabic Historical Linguistics". International Journal of Computational Linguistics 5(2):pp.27-36.
Mrs. Areej Odah Alshutayri
Faculty of Computing and Information Technology King Abdul Aziz University Jeddah, Saudi Arabia and School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom
aalshetary@kau.edu.sa
Associate Professor Eric Atwell
School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom