Home > CSC-OpenAccess Library > Manuscript Information

This is an Open Access publication published under CSC-OpenAccess Policy.

# PUBLICATIONS BY COUNTRIES

Top researchers from over 74 countries worldwide have trusted us because of quality publications.

United States of America | |

United Kingdom | |

Canada | |

Australia | |

Malaysia | |

China | |

Japan | |

Saudi Arabia | |

Egypt | |

India |

Farthest Neighbor Approach for Finding Initial Centroids in K- Means

N.Sandhya, K. Anuradha, V. Sowmya, Ch. Vidyadhari

Pages - 1 - 13 | Revised - 10-08-2014 | Published - 15-09-2014

Published in International Journal of Data Engineering (IJDE)

MORE INFORMATION

KEYWORDS

Text Clustering, Partitional Approach, Initial Centroids, Similarity Measures, Cluster Accuracy.

ABSTRACT

Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.

1 | O. Zamir, O. Etzioni, O. Madani, R.M. Karp, “Fast and intuitive clustering of web documents”,in: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997, pp. 287–290. |

2 | C.C. Aggarwal, S.G. Gates, P.S. Yu, “On the merits of building categorization systems by supervised clustering”, in: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.352–356. |

3 | B. Larson, C. Aone, “Fast and effective text mining using linear-time document clustering”, in:Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 98(463), 1999, pp. 16–22. |

4 | Salton, G., Wong, A., Yang, C.S. (1975). “A vector space model for automatic indexing”.Communications of the ACM, 18(11):613-620. |

5 | Anna Huang, “Similarity Measures for Text Document Clustering”, published in the proceedings of New Zealand Computer Science Research Student Conference 2008. |

6 | Saurabh Sharma, Vishal Gupta. ”Domain Based Punjabi Text Document Clustering”.Proceedings of COLING 2012: Demonstration Papers, pages 393–400,COLING 2012,Mumbai, December 2012. |

7 | D. Manning, Prabhakar Raghavan, Hinrich Schütze, “An Introduction to Information Retrieval Christopher”, Cambridge University Press, Cambridge, England |

8 | M.F. Porter, “An algorithm for suffix stripping”, Program, vol.14, no.3, pp. 130-137, 1980. |

9 | C.J.Van Rijsbergen,(1989), “Information Retrieval”, Buttersworth, London, Second Edition. |

10 | G. Kowalski,”Information Retrieval Systems – Theory and Implementation”, Kluwer Academic Publishers, 1997. |

11 | D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey, Scatter/Gather: ”A Cluster-based Approach to Browsing Large Document Collections”, SIGIR ‘92, Pages 318 – 329, 1992. |

12 | O. Zamir, O. Etzioni, O. Madani, R.M. Karp, Fast and Intuitive Clustering of Web Documents,KDD ’97, Pages 287-290, 1997. |

13 | G. Salton, M.J. McGill, “Introduction to Modern Information Retrieval”. McGraw-Hill, 1989. |

14 | A. Ehrenfeucht and D. Haussler. “A new distance metric on strings computable in linear time”. Discrete Applied Math, 1988. |

15 | M. Rodeh, V. R. Pratt, and S. Even. “Linear algorithm for data compression via string matching”. In Journal of the ACM, pages 28(1):16–24, 1981. |

16 | Peter Weiner. “Linear pattern matching algorithms”. In SWAT ’73: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11,Washington, DC, USA, 1973. IEEE Computer Society. |

17 | R. Baeza-Yates, B. Ribeiro-Neto, “Modern Information Retrieval”, Addison-Wesley, 1999. |

18 | Harmanpreet singh, Kamaljit Kaur, “New Method for Finding Initial Cluster Centroids in Kmeans Algorithm”,International Journal of Computer Applications (0975 – 8887) Volume 74–No.6, July 2013 |

19 | Anderberg, M, “Cluster analysis for applications” ,Academic Press, New York 1973. |

20 | Tou, J., Gonzales, “Pattern Recognition Principles” ,Addison-Wesley, Reading, MA, 1974. |

21 | Katsavounidis, I., Kuo, C., Zhang, Z., “A new initialization technique for generalized lloyd iteration”, IEEE Signal Processing Letters 1 (10), 1994, pp. 144-146. |

22 | Bradley, P. S., Fayyad, “Refining initial points for K-Means clustering”, Proc. 15th International Conf. on Machine Learning, San Francisco, CA, 1998, pp. 91-99. |

23 | Koheri Arai and Ali Ridho Barakbah, “Hierarchical k-means: an algorithm for centroids initialization for k-means”, Reports of The Faculty of Science and Engineering Saga University, vol. 36, No.1, 2007. |

24 | Samarjeet Borah, M.K. Ghose, “Performance Analysis of AIM-K-means & K- means in Quality Cluster Generation”, Journal of Computing, vol. 1, Issue 1, December 2009. |

25 | Ye Yunming, “Advances in knowledge discovery and data mining”, (Springer, 2006). |

26 | K. A. Abdul Nazeer and M. P. Sebastian, “ Improving the accuracy and efficiency of the kmeans clustering algorithm”, Proceedings of the World Congress on Engineering, London,UK, vol. 1, 2009. |

27 | Madhu Yedla, S.R. Pathakota, T.M. Srinivasa, “Enhancing K-means Clustering Algorithm with Improved Initial Centre”, International Journal of Computer Science and Information Technologies, 1 (2) , 2010, pp. 121-125. |

Professor N.Sandhya

VNRVJIET - India

sandhyanadela@gmail.com

Professor K. Anuradha

Professor/CSE Gokaraju Rangaraju Institute of Engineering and Technology Hyderabad, 500 090,India - India

Associate Professor V. Sowmya

Associate.Prof/CSE Gokaraju Rangaraju Institute of Engineering and Technology Hyderabad, 500 090,India - India

Associate Professor Ch. Vidyadhari

Asst.Prof/CSE Gokaraju Rangaraju Institute of Engineering and Technology Hyderabad, 500 090,India - India