Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(344.63KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
Publications from CSC-OpenAccess Library are being accessed from over 74 countries worldwide.
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Jonathan Samuel Sumamo, Solomon Teferra
Pages - 41 - 54     |    Revised - 31-07-2018     |    Published - 01-10-2018
Volume - 9   Issue - 2    |    Publication Date - June 2018  Table of Contents
MORE INFORMATION
KEYWORDS
Kambaata Stemmer, Rule-Based Stemmer, Stemming Algorithm, Kambaata Language.
ABSTRACT
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.

The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.

The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
1 Google Scholar 
2 BibSonomy 
3 Doc Player 
4 Scribd 
5 SlideShare 
1 J. B. Lovins, "Development of a stemming algorithm," Mechanical Translation and Computational Linguistics, vol. 11, no. 1 and 2, 1968.
2 Y. Fisseha, "Development of Stemming Algorism for Tigrigna Text," Master's Thesis, Addis Ababa University, Addis Ababa, June 2011, unpublished.
3 M. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.
4 N. Alemayehu and P. Willet, "Stemming of Amharic Words for Information Retrieval," Literary and Linguistic Computing, vol. 17, no. 1, pp. 1-17, 2002.
5 Y. Treis, A grammar of Kambaata (Ethiopia), Part I: Phonology, Nominal Morphology and Non-verbal Predication, 1st ed. Köln: Rüdiger Köppe, 2008.
6 Y. Treis, "Relativization in Kambaata from a typological point of view," In: Zygmunt Frajzyngier and Erin Shay (eds.), Interaction of morphology and syntax: Case studies in Afroasiatic, pp. 161-206, Amsterdam/Philadelphia: Benjamins. 2008b.
7 W. B. Frakes, "Stemming algorithms. In Frakes," in Information retrieval: data structures and algorithms: Prentice-Hall, 1992, pp. 131-160.
8 L. Lessa, "Development of stemming algorithm for Wolaytta text," Master's Thesis, Addis Ababa University, Addis Ababa, July 2003, unpublished.
9 G. Salton, Automatic text processing: The Transformation, Analysis, and Retrieval of Information by Computer, 1st ed. Reading, Mass. [etc.]: Addison-Wesley, 1989.
10 M. P. Lennon, D. Tarry, and P. Willett, "An evaluation of conflation algorithms for information retrieval," Journal of Information Science, vol. 3, pp. 177-183, 1981.
11 J. Savoy, "Stemming of French Words Based on Grammatical Categories," Journal of American Society for Information Science, vol. 44, no. 1, pp. 1-9, 1993.
12 A. Alemu and L. Asker, "An Amharic Stemmer: Reducing Words to their Citation Forms," The Association for Computational Linguistics, Prague, Czech Republic, June 2007.
13 D. Tesfaye, and E. Abebe, "Designing a Rule Based Stemmer for Afaan Oromo Text," International journal of computational linguistics (IJCL), vol. 1, no. 2, October 2010.
14 C. Paice, "Method for evaluation of stemming algorithms based on error counting," Journal of the American Society for Information Science, vol. 47, no. 8, pp. 632-649, 1996.
15 J. Dawson, "Suffix removal for word conflation," In Bulletin of the Association for Literary and Linguistics computing, vol. 2, No. 3, pp. 33-46, 1974.
16 Rani, SP Ruba, B. Ramesh, M. Anusha, and J. G. R. Sathiaseelan, "Evaluation of Stemming Techniques for Text Classification," International Journal of Computer Science and Mobile Computing, vol. 4, no. 3, pp. 165-171, 2015.
17 P. Willett, "The Porter stemming algorithm: then and now," Program, vol. 40, no. 3, pp. 219-223, 2006.
18 C. D. Paice, "Another stemmer," ACM SIGIR Forum, vol. 24, no. 3, pp. 56-61, 1990.
19 R. Krovetz, "Viewing Morphology as an inference process," In proceedings of the 16thAnnual International ACM SIGIR conference on research and development in information retrieval, pp. 191-202, ACM New York, 1993.
20 A. Ismailov, M.M. Abdul Jalil, Z. Abdullah and N.H. Rahim, "A Comparative Study of Stemming Algorithms for Use with the Uzbek Language," In proceedings of the 3rd International Conference on Computer and Information Sciences (ICCOINS), 2016.
21 M. Wakshum, "Development of Stemming Algorithm for Afaan Oromo Text," M. Sc. Theses, Addis Ababa University, 2000, unpublished.
22 Anjali Ganesh Jivani et al, "A Comparative Study of Stemming Algorithms," Int. J. Comp. Tech. Appl., vol. 2, no. 6, pp. 1930-1938.
23 "Ethnologue: Languages of the World," Ethnologue, 2017. [Online]. Available: https://www.ethnologue.com.country/ET [Accessed: 12- Dec- 2017.
24 Y. Treis, "Kambaata Numerals and Denumerals Revisited," LLACAN.
25 Y. Treis, "Categorial hybrids in Kambaata," Journal of African Languages and Linguistics, De Gruyter, pp. 215-254, 2012.
26 Y. Treis, "Expressing future time reference in Kambaata," Nordic Journal of African Studies, vol. 20, no. 2, pp.132-149, 2012.
27 D. Harman, "How effective is suffixing?" Journal of the American Society for Information Science, vol. 42, no. 1, pp. 7-15, 1991.
28 Md. Islam, Md. Uddin and M. Khan, "A Light Weight Stemmer for Bengali and Its Use in Spelling Checker," Center for Research on Bangla Language Processing, BRAC University, Dhaka, Bangladesh.
29 D. Sharma, "Stemming Algorithms: A Comparative Study and their Analysis," International Journal of Applied Information Systems, vol. 4, no. 3, pp. 1-6, 2012.
Mr. Jonathan Samuel Sumamo
Telecom Excellence Academy Ethio Telecom's Corporate University - Ethiopia
jimmyelove@gmail.com
Dr. Solomon Teferra
Faculty of Informatics/Department of Information Science, Addis Ababa University Addis Ababa - Ethiopia