Home   >   CSC-OpenAccess Library   >    Manuscript Information
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Jonathan Samuel Sumamo, Solomon Teferra
Pages - 41 - 54     |    Revised - 31-07-2018     |    Published - 01-10-2018
Volume - 9   Issue - 2    |    Publication Date - June 2018  Table of Contents
MORE INFORMATION
KEYWORDS
Kambaata Stemmer, Rule-Based Stemmer, Stemming Algorithm, Kambaata Language.
ABSTRACT
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.

The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.

The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
1 Google Scholar 
2 BibSonomy 
3 Doc Player 
4 Scribd 
5 SlideShare 
"Ethnologue: Languages of the World," Ethnologue, 2017. [Online]. Available: https://www.ethnologue.com.country/ET [Accessed: 12- Dec- 2017.
A. Alemu and L. Asker, "An Amharic Stemmer: Reducing Words to their Citation Forms," The Association for Computational Linguistics, Prague, Czech Republic, June 2007.
A. Ismailov, M.M. Abdul Jalil, Z. Abdullah and N.H. Rahim, "A Comparative Study of Stemming Algorithms for Use with the Uzbek Language," In proceedings of the 3rd International Conference on Computer and Information Sciences (ICCOINS), 2016.
Anjali Ganesh Jivani et al, "A Comparative Study of Stemming Algorithms," Int. J. Comp. Tech. Appl., vol. 2, no. 6, pp. 1930-1938.
C. D. Paice, "Another stemmer," ACM SIGIR Forum, vol. 24, no. 3, pp. 56-61, 1990.
C. Paice, "Method for evaluation of stemming algorithms based on error counting," Journal of the American Society for Information Science, vol. 47, no. 8, pp. 632-649, 1996.
D. Harman, "How effective is suffixing?" Journal of the American Society for Information Science, vol. 42, no. 1, pp. 7-15, 1991.
D. Sharma, "Stemming Algorithms: A Comparative Study and their Analysis," International Journal of Applied Information Systems, vol. 4, no. 3, pp. 1-6, 2012.
D. Tesfaye, and E. Abebe, "Designing a Rule Based Stemmer for Afaan Oromo Text," International journal of computational linguistics (IJCL), vol. 1, no. 2, October 2010.
G. Salton, Automatic text processing: The Transformation, Analysis, and Retrieval of Information by Computer, 1st ed. Reading, Mass. [etc.]: Addison-Wesley, 1989.
J. B. Lovins, "Development of a stemming algorithm," Mechanical Translation and Computational Linguistics, vol. 11, no. 1 and 2, 1968.
J. Dawson, "Suffix removal for word conflation," In Bulletin of the Association for Literary and Linguistics computing, vol. 2, No. 3, pp. 33-46, 1974.
J. Savoy, "Stemming of French Words Based on Grammatical Categories," Journal of American Society for Information Science, vol. 44, no. 1, pp. 1-9, 1993.
L. Lessa, "Development of stemming algorithm for Wolaytta text," Master's Thesis, Addis Ababa University, Addis Ababa, July 2003, unpublished.
M. P. Lennon, D. Tarry, and P. Willett, "An evaluation of conflation algorithms for information retrieval," Journal of Information Science, vol. 3, pp. 177-183, 1981.
M. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.
M. Wakshum, "Development of Stemming Algorithm for Afaan Oromo Text," M. Sc. Theses, Addis Ababa University, 2000, unpublished.
Md. Islam, Md. Uddin and M. Khan, "A Light Weight Stemmer for Bengali and Its Use in Spelling Checker," Center for Research on Bangla Language Processing, BRAC University, Dhaka, Bangladesh.
N. Alemayehu and P. Willet, "Stemming of Amharic Words for Information Retrieval," Literary and Linguistic Computing, vol. 17, no. 1, pp. 1-17, 2002.
P. Willett, "The Porter stemming algorithm: then and now," Program, vol. 40, no. 3, pp. 219-223, 2006.
R. Krovetz, "Viewing Morphology as an inference process," In proceedings of the 16thAnnual International ACM SIGIR conference on research and development in information retrieval, pp. 191-202, ACM New York, 1993.
Rani, SP Ruba, B. Ramesh, M. Anusha, and J. G. R. Sathiaseelan, "Evaluation of Stemming Techniques for Text Classification," International Journal of Computer Science and Mobile Computing, vol. 4, no. 3, pp. 165-171, 2015.
W. B. Frakes, "Stemming algorithms. In Frakes," in Information retrieval: data structures and algorithms: Prentice-Hall, 1992, pp. 131-160.
Y. Fisseha, "Development of Stemming Algorism for Tigrigna Text," Master's Thesis, Addis Ababa University, Addis Ababa, June 2011, unpublished.
Y. Treis, "Categorial hybrids in Kambaata," Journal of African Languages and Linguistics, De Gruyter, pp. 215-254, 2012.
Y. Treis, "Expressing future time reference in Kambaata," Nordic Journal of African Studies, vol. 20, no. 2, pp.132-149, 2012.
Y. Treis, "Kambaata Numerals and Denumerals Revisited," LLACAN.
Y. Treis, "Relativization in Kambaata from a typological point of view," In: Zygmunt Frajzyngier and Erin Shay (eds.), Interaction of morphology and syntax: Case studies in Afroasiatic, pp. 161-206, Amsterdam/Philadelphia: Benjamins. 2008b.
Y. Treis, A grammar of Kambaata (Ethiopia), Part I: Phonology, Nominal Morphology and Non-verbal Predication, 1st ed. Köln: Rüdiger Köppe, 2008.
Mr. Jonathan Samuel Sumamo
Telecom Excellence Academy Ethio Telecom's Corporate University - Ethiopia
jimmyelove@gmail.com
Dr. Solomon Teferra
Faculty of Informatics/Department of Information Science, Addis Ababa University Addis Ababa - Ethiopia