Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

This is an Open Access publication published under CSC-OpenAccess Policy.
Semantic Based Model for Text Document Clustering with Idioms
B. Drakshayani, E. V. Prasad
Pages - 1 - 13     |    Revised - 15-01-2013     |    Published - 28-02-2013
Volume - 4   Issue - 1    |    Publication Date - January / February 2013  Table of Contents
Document Clustering, Idiom, POS Tagging, Semantic Weight, Semantic Grammar, Hierarchical Clustering Algorithm, Chameleon, Natural Language Processing
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
CITED BY (4)  
1 Brar, S., Mathur, D., Sharma, N., & Phagwara, P. Enhancement in Semantic based Model for Text Document Clustering.
2 Drakshayania, B., & Prasad, E. V. Hybrid Clustering Model for Text Documents with Semantic Based Document Representation.
3 Drakshayani, B., & Prasad, E. V. Metaphor based Document Representation Model for Text Document Clustering. In IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions (pp. 74-78).
4 Suneetha, s., & rani, m. u. status quo of semantic-based text document clustering: a review.
1 Google Scholar
2 CiteSeerX
3 refSeek
4 Scribd
5 SlideShare
6 PdfSR
1 A K Jain, "Data clustering : 50 Years Beyond K-Means," in International Conference in Pattern recognition, Pattern Recognition Letters, 31, Issue 8, pp. 651-656, June 2010.
2 D.Wunsch II, and R.Xu, "Survey of Clustering Algorithms," IEEE Transactions on Neural Networks, vol. 16, No. 3, pp. 46-51, May 2005, DOI:10.1109/TNN.2005.845141.
3 U.S.Tiwari, T.Siddiqui, Natural Language Processing and Information Retrieval., Oxford University Press.
4 L.Huang, D.Milne, E.Frank and L.H.Witten, “ Learning a Concept-Based Document Similarity Measure”, Journal of the American Society for Information Science and Technology, 63(8):1593-1608, July 2012,DOI: 10.1002/asi.22689.
5 A.Wong, C S Yang G Salton, "A vector space model for Automatic indexing ," Communication ACM, vol. 18, no. 11, pp. 112-117, 1975.
6 Supreethi.K.P and E.V.Prasad, "A Novel Document Representation Model for Clustering,"International Journal of Computer Science &Communication, vol. 1, no. 2, pp. 243-245,December 2010.
7 W.K.God, M.S.Kamel, “PH-SSBM: Phrase Semantic Similarity Based Model for Document Clustering”, IEEE Second International Symposium on Knowledge Acquisition and Modeling, 978-0-7695-3888-4/09, April 2009, DOI: 10.1109/kam.2009.191.
8 Supreethi.K.P and E.V.Prasad, “ Web Document Clustering using Case Grammar Structure”,International Conference on Computational Intelligence & Multimedia Applications, vol.2, pp. 98-102, Dec 2007, DOI: 10.1109/ICCIMA.2007.245.
9 David Holmes, “Idioms and Expressions “, a method for learning and remembering idioms and expressions.
10 POS Tagging-The Stanford Parser, nlp.stanford.edu/software/lex-parser.shtml.
11 S. Staab, and G.Stumme A.Hotho, "Wordnetimprovetext document clustering," in proceedings of the Semantic web workshop SIGIR, 2003, pp. 541-544.
12 Z.Elberrichi and M.Simonet Abdelmalek Amine, “Evaluation of Text Clustering Methods using WordNet", International Arab Journal of Information Technology, vol. 7, no. 4, Oct 2010.
13 M.F.Porter, “An algorithm for suffix stripping”, Program: electronic library and information systems,Vol. 14 Iss: 3, pp.130 – 137, 1980, DOI:10.1108/eb046814.
14 F.Murtagh "A Survey of Recent Advances in Hierarchical Clustering Algorithms ", in the Computer Journal, vol. 26, no. 4, Jan1983, pp. 354- 359.
15 G.Karypis, Eui-Hong Han, Vipin Kumar, “Chameleon: Hierarchical Clustering using Dynamic Modeling “, IEEE International Journal of Computer, Aug1999, vol.32, Issue 8, pp.68-75, DOI:10.1109/2.781637.
16 M.A.Abbas and A.A.Shoukry, “Clustering Using Shared Reference Points Algorithm Based on a Sound Data Model”, International Journal of Data Engineering(IJDE), Volume 3, Issue 2,2012.
17 UCIKDD ARCHIVE, kdd.ics.uci.edu.
Mr. B. Drakshayani
Lecturer in CME Govt. Polytechnic Nalgonda, 508001 - India
Dr. E. V. Prasad
Rector, JNTUK Kakinada - India