Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(346.42KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
Publications from CSC-OpenAccess Library are being accessed from over 74 countries worldwide.
A Comparative Evaluation of POS Tagging and N-Gram Measures in Arabic Corpus Resources and Tools
Sultan Almujaiwel
Pages - 1 - 17     |    Revised - 31-01-2020     |    Published - 29-02-2020
Volume - 11   Issue - 1    |    Publication Date - February 2020  Table of Contents
MORE INFORMATION
KEYWORDS
Arabic Corpus Resources, Arabic Corpus Analysis Tools, Corpus Linguistics, Confusion Matrices, Association Algorithms.
ABSTRACT
The purpose of this evaluation is twofold: an overview of the extent to which the functioning of the large-scale Arabic corpus resources examined serves the criteria of parts-of-speech tagging in the corpus design of linguistic data and to evaluate Arabic corpus analysis tools in terms of natural language processing statistics. The confusion matrix statistical method shows that some Arabic monitor corpora need further development, and the International Corpus of Arabic scores high levels on confusion matrices. There are nine Arabic corpus analysis tools under evaluation, and the attested reliable statistical outcomes are retrieved in terms of statistical algorithms for association measures. This is done by relying on one million empirically designated clean Arabic data to evaluate the association measures among the nine Arabic corpus analysis tools. The results presented at the end of this article indicate that the limitations could be tackled by evaluating the Arabic monitor Corpus resources rather than trusting them, and by implementing the new forms of programming rather than depending on the already-built natural Arabic language resources and tools.
1 "KACSTAC," KACST Arabic Corpus, [On-line]. Available: https://corpus.kacst.edu.sa/ [Dec. 14, 2019].
2 K. McNeil and M. Faiza. "Tunisian Arabic Corpus TAC," [On-line]. Available: http://www.tunisiya.org [Dec. 14, 2019].
3 J. Younes, H. Achour and E. Souissi. "Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web," in Proceedings of the 1st International Workshop on Natural Language Processing for Informal Text NLPIT in conjunction with The International Conference on Web Engineering (ICWE). 2015.
4 "TAC," Tunisian Arabic Corpus, [On-line]. Available: http://www.tunisiya.org/ [Oct. 25, 2019].
5 S. Sharoff. "Creating general-purpose corpora using automated search engine queries," in WaCky! Working papers on the web as Corpus, 2006, pp. 63-98.
6 D. Parkinson. "arabiCorpus," [On-line]. Available: http://arabicorpus.byu.edu/. [Dec. 10, 2019].
7 "ICA," The International Corpus of Arabic, [On-line]. Available: http://www.bibalex.org/ica/ar/login.aspx. [Oct. 30, 2019).
8 S. Almujaiwel and A. Al-Thubaity. "Arabic Corpus Processing Tools for Corpus Linguistics and Language Teaching," in Proceedings of the International Conference on the Globalization of Second Language Acquisition and Teacher Education (G-SLATE), 2016, pp. 103-108.
9 A. Roberts. "aConCorde." [On-line]. Available: http://www.andy-roberts.net/coding/aconcorde [Dec. 14, 2019].
10 L. Anthony. "AntConc: design and development of a freeware corpus analysis toolkit for the technical writing classroom," in Proceedings of International Professional Communication Conference (IPCC), 2005.
11 L. Anthony. "AntConc." [On-line]. Available: http://www.antlab.sci.waseda.ac.jp/ [Dec. 14, 2019].
12 M. Scott. "Developing Wordsmith." International Journal of English Studies, vol. 8, no. 1, pp. 95-106, 2008.
13 M. Scott. "WordSmith Tools version 6." [On-line]. Available: http://www.lexically.net/wordsmith [Dec. 14, 2019].
14 A. Kilgarriff, A. "Sketch Engine," [On-line]. Available: http://www.sketchengine.co.uk/ [Oct. 25, 2019].
15 A. Kilgarriff, P. Rychly, P. Smrz and D. Tugwell. "The sketch engine," in Proceedings of the Euralex, 2004.
16 "Sketch Engine," Lexical Computing: Language corpus management and query system, [On-line]. Available: https://www.sketchengine.eu/ [Dec. 14, 2019].
17 M. Barlow. "MonoConc." [On-line]. Available: http://www.monoconc.com/ [Dec. 14, 2019).
18 S. Tsukamoto. "KWIC Concordance." [On-line]. Available: http://dep.chs.nihon-u.ac.jp/english_lang/tukamoto/kwic_e.html [Dec. 14, 2019].
19 V. Brezina, T. McEnery, and S. Wattam. "Collocations in context: A new perspective on collocation networks." International Journal of Corpus Linguistics, vol. 20, no. 2, pp. 139-173, 2015.
20 Alfaifi, A., & Atwell, E. Comparative evaluation of tools for Arabic corpora search and analysis. International Journal of Speech and Technology, 19(2), 347-357, 2016.
21 S. Atkins, J. Clear and N. Ostler. "Corpus design criteria." Literary and Linguistic Computing, vol. 7, no. 1, pp. 1-16, 1991.
22 T. Arts, Y. Belinkov, N. Habash, A. Kilgarriff and V. Suchomel. "arTenTen: Arabic corpus and word sketches." Journal of King Saud University - Computer and Information Sciences, vol. 26, no. 4, pp. 357-371, 2014.
23 S. Th. Gries and A. L. Berez. "Linguistic annotation in/for corpus linguistics," in Handbook of linguistic annotation, N. Ide and J. Pustejovsky, Eds. Berlin & New York: Springer, 2017, pp. 379-408.
24 S. Bartsch, R. Eckart, M. Holtz and E. Teich. "Corpus-based register profiling of texts from mechanical engineering," in Proceedings of the Corpus Linguistics Conference (CL2005). 2005.
25 K. Cohen, W. Baumgartner Jr and I. Temnikova. "SuperCAT: The (New and Improved) Corpus Analysis Toolkit," in Proceedings of the International Conference on Language Resources & Evaluation (LREC2016), pp.2748-2788.
26 L. Anthony. "A critical look at software tools in corpus linguistics," Linguistic Research, vol. 30, no. 2, pp. 141-161, 2013.
27 McEnery, T. and Hardie, A. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press, 2012, pp. 5-48.
28 Davies, M. "corpus.byu.edu." [On-line]. Available: https://corpus.byu.edu/corpora.asp [Dec. 14, 2019].
29 A. Hardie. "CQPweb." [On-line]. Available: http://cwb.sourceforge.net/cqpweb.php [Dec. 14, 2019].
30 P. Rayson. "Wmatrix." [On-line]. Available: http://ucrel.lancs.ac.uk/wmatrix/ [Dec. 20, 2019].
31 L. Bernard and T. Dodd. "Xara: an XML aware tool for corpus searching," in: Proceedings of the Corpus Linguistics Conference D. Archer, P. Rayson, A. Wilson and T. McEnery, Eds. Lancaster: University of Lancaster, 2003, pp. 142-144.
32 "The Bank of English," WordbanksOnline, [On-line]. Available: https://wordbanks.harpercollins.co.uk/ [Dec. 20, 2019].
33 D. Miller and D. Biber. "Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition." International Journal of Corpus Linguistics, vol. 20, no. 1, pp. 30-53, 2015.
34 S. Th. Gries. "The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models." Corpora, vol. 10, no. 1, pp. 95-125, 2015.
35 S. Th. Gries. "Quantitative designs and statistical techniques," in The Cambridge handbook of English corpus linguistics. D. Biber and R. Reppen, Eds. Cambridge: Cambridge University Press, 2015, pp. 50-71.
36 S. Th. Gries. Ten lectures on quantitative approaches in cognitive linguistics: Corpus-linguistic, experimental, and statistical applications. Leiden & Boston: Brill, 2017.
37 C. Gabrielatos, T. McEnery, P. Diggle and P. Baker. "The peaks and troughs of corpus-based contextual analysis." International Journal of Corpus Linguistics, vol. 37, no. 2, pp. 151-175, 2012.
38 V. Brezina. Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press, 2018.
39 S. Almujaiwel. "Grammatical construction of function words between old and modern written Arabic: A corpus-based analysis." Corpus Linguistics and Linguistic Theory, vol. 15, no. 2, pp. 267-296, 2019.
40 M. Brysbaert, P. Mandera and E. Keuleers. "The Word Frequency Effect in Word Processing: An Updated Review". Current Directions in Psychological Science, vol. 27, no. 1, 2018, pp. 47-50.
Associate Professor Sultan Almujaiwel
College of Arts/Arabic Language Department, King Saud University, Riyadh, Saudi Arabia - Saudi Arabia
salmujaiwel@ksu.edu.sa