Home   >   CSC-OpenAccess Library   >    Manuscript Information
A Comparative Evaluation of POS Tagging and N-Gram Measures in Arabic Corpus Resources and Tools
Sultan Almujaiwel
Pages - 1 - 17     |    Revised - 31-01-2020     |    Published - 29-02-2020
Volume - 11   Issue - 1    |    Publication Date - February 2020  Table of Contents
MORE INFORMATION
KEYWORDS
Arabic Corpus Resources, Arabic Corpus Analysis Tools, Corpus Linguistics, Confusion Matrices, Association Algorithms.
ABSTRACT
The purpose of this evaluation is twofold: an overview of the extent to which the functioning of the large-scale Arabic corpus resources examined serves the criteria of parts-of-speech tagging in the corpus design of linguistic data and to evaluate Arabic corpus analysis tools in terms of natural language processing statistics. The confusion matrix statistical method shows that some Arabic monitor corpora need further development, and the International Corpus of Arabic scores high levels on confusion matrices. There are nine Arabic corpus analysis tools under evaluation, and the attested reliable statistical outcomes are retrieved in terms of statistical algorithms for association measures. This is done by relying on one million empirically designated clean Arabic data to evaluate the association measures among the nine Arabic corpus analysis tools. The results presented at the end of this article indicate that the limitations could be tackled by evaluating the Arabic monitor Corpus resources rather than trusting them, and by implementing the new forms of programming rather than depending on the already-built natural Arabic language resources and tools.
1 Google Scholar 
2 Semantic Scholar 
3 refSeek 
4 Scribd 
5 SlideShare 
"ICA," The International Corpus of Arabic, [On-line]. Available: http://www.bibalex.org/ica/ar/login.aspx. [Oct. 30, 2019).
"KACSTAC," KACST Arabic Corpus, [On-line]. Available: https://corpus.kacst.edu.sa/ [Dec. 14, 2019].
"Sketch Engine," Lexical Computing: Language corpus management and query system, [On-line]. Available: https://www.sketchengine.eu/ [Dec. 14, 2019].
"TAC," Tunisian Arabic Corpus, [On-line]. Available: http://www.tunisiya.org/ [Oct. 25, 2019].
"The Bank of English," WordbanksOnline, [On-line]. Available: https://wordbanks.harpercollins.co.uk/ [Dec. 20, 2019].
A. Hardie. "CQPweb." [On-line]. Available: http://cwb.sourceforge.net/cqpweb.php [Dec. 14, 2019].
A. Kilgarriff, A. "Sketch Engine," [On-line]. Available: http://www.sketchengine.co.uk/ [Oct. 25, 2019].
A. Kilgarriff, P. Rychly, P. Smrz and D. Tugwell. "The sketch engine," in Proceedings of the Euralex, 2004.
A. Roberts. "aConCorde." [On-line]. Available: http://www.andy-roberts.net/coding/aconcorde [Dec. 14, 2019].
Alfaifi, A., & Atwell, E. Comparative evaluation of tools for Arabic corpora search and analysis. International Journal of Speech and Technology, 19(2), 347-357, 2016.
C. Gabrielatos, T. McEnery, P. Diggle and P. Baker. "The peaks and troughs of corpus-based contextual analysis." International Journal of Corpus Linguistics, vol. 37, no. 2, pp. 151-175, 2012.
D. Miller and D. Biber. "Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition." International Journal of Corpus Linguistics, vol. 20, no. 1, pp. 30-53, 2015.
D. Parkinson. "arabiCorpus," [On-line]. Available: http://arabicorpus.byu.edu/. [Dec. 10, 2019].
Davies, M. "corpus.byu.edu." [On-line]. Available: https://corpus.byu.edu/corpora.asp [Dec. 14, 2019].
J. Younes, H. Achour and E. Souissi. "Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web," in Proceedings of the 1st International Workshop on Natural Language Processing for Informal Text NLPIT in conjunction with The International Conference on Web Engineering (ICWE). 2015.
K. Cohen, W. Baumgartner Jr and I. Temnikova. "SuperCAT: The (New and Improved) Corpus Analysis Toolkit," in Proceedings of the International Conference on Language Resources & Evaluation (LREC2016), pp.2748-2788.
K. McNeil and M. Faiza. "Tunisian Arabic Corpus TAC," [On-line]. Available: http://www.tunisiya.org [Dec. 14, 2019].
L. Anthony. "A critical look at software tools in corpus linguistics," Linguistic Research, vol. 30, no. 2, pp. 141-161, 2013.
L. Anthony. "AntConc." [On-line]. Available: http://www.antlab.sci.waseda.ac.jp/ [Dec. 14, 2019].
L. Anthony. "AntConc: design and development of a freeware corpus analysis toolkit for the technical writing classroom," in Proceedings of International Professional Communication Conference (IPCC), 2005.
L. Bernard and T. Dodd. "Xara: an XML aware tool for corpus searching," in: Proceedings of the Corpus Linguistics Conference D. Archer, P. Rayson, A. Wilson and T. McEnery, Eds. Lancaster: University of Lancaster, 2003, pp. 142-144.
M. Barlow. "MonoConc." [On-line]. Available: http://www.monoconc.com/ [Dec. 14, 2019).
M. Brysbaert, P. Mandera and E. Keuleers. "The Word Frequency Effect in Word Processing: An Updated Review". Current Directions in Psychological Science, vol. 27, no. 1, 2018, pp. 47-50.
M. Scott. "Developing Wordsmith." International Journal of English Studies, vol. 8, no. 1, pp. 95-106, 2008.
M. Scott. "WordSmith Tools version 6." [On-line]. Available: http://www.lexically.net/wordsmith [Dec. 14, 2019].
McEnery, T. and Hardie, A. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press, 2012, pp. 5-48.
P. Rayson. "Wmatrix." [On-line]. Available: http://ucrel.lancs.ac.uk/wmatrix/ [Dec. 20, 2019].
S. Almujaiwel and A. Al-Thubaity. "Arabic Corpus Processing Tools for Corpus Linguistics and Language Teaching," in Proceedings of the International Conference on the Globalization of Second Language Acquisition and Teacher Education (G-SLATE), 2016, pp. 103-108.
S. Almujaiwel. "Grammatical construction of function words between old and modern written Arabic: A corpus-based analysis." Corpus Linguistics and Linguistic Theory, vol. 15, no. 2, pp. 267-296, 2019.
S. Atkins, J. Clear and N. Ostler. "Corpus design criteria." Literary and Linguistic Computing, vol. 7, no. 1, pp. 1-16, 1991.
S. Bartsch, R. Eckart, M. Holtz and E. Teich. "Corpus-based register profiling of texts from mechanical engineering," in Proceedings of the Corpus Linguistics Conference (CL2005). 2005.
S. Sharoff. "Creating general-purpose corpora using automated search engine queries," in WaCky! Working papers on the web as Corpus, 2006, pp. 63-98.
S. Th. Gries and A. L. Berez. "Linguistic annotation in/for corpus linguistics," in Handbook of linguistic annotation, N. Ide and J. Pustejovsky, Eds. Berlin & New York: Springer, 2017, pp. 379-408.
S. Th. Gries. "Quantitative designs and statistical techniques," in The Cambridge handbook of English corpus linguistics. D. Biber and R. Reppen, Eds. Cambridge: Cambridge University Press, 2015, pp. 50-71.
S. Th. Gries. "The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models." Corpora, vol. 10, no. 1, pp. 95-125, 2015.
S. Th. Gries. Ten lectures on quantitative approaches in cognitive linguistics: Corpus-linguistic, experimental, and statistical applications. Leiden & Boston: Brill, 2017.
S. Tsukamoto. "KWIC Concordance." [On-line]. Available: http://dep.chs.nihon-u.ac.jp/english_lang/tukamoto/kwic_e.html [Dec. 14, 2019].
T. Arts, Y. Belinkov, N. Habash, A. Kilgarriff and V. Suchomel. "arTenTen: Arabic corpus and word sketches." Journal of King Saud University - Computer and Information Sciences, vol. 26, no. 4, pp. 357-371, 2014.
V. Brezina, T. McEnery, and S. Wattam. "Collocations in context: A new perspective on collocation networks." International Journal of Corpus Linguistics, vol. 20, no. 2, pp. 139-173, 2015.
V. Brezina. Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press, 2018.
Associate Professor Sultan Almujaiwel
College of Arts/Arabic Language Department, King Saud University, Riyadh, Saudi Arabia - Saudi Arabia
salmujaiwel@ksu.edu.sa