Home   >   CSC-OpenAccess Library   >    Manuscript Information
Combining Approximate String Matching Algorithms and Term Frequency In The Detection of Plagiarism
Zina Balani, Cihan Varol
Pages - 97 - 105     |    Revised - 25-07-2021     |    Published - 31-08-2021
Volume - 15   Issue - 4    |    Publication Date - August 2021  Table of Contents
Approximate, Hybrid, Plagiarism, Similarity, TFIDF.
One of the key factors behind plagiarism is the availability of a large amount of data and information on the internet that can be accessed rapidly. This increases the risk of academic fraud and intellectual property theft. As increasing anxiety over plagiarism grow, more observation was drawn towards automatic plagiarism detection. Hybrid algorithms are regarded as one of the most prospective ways to detect similarity of everyday language or source code written by a student. This study investigates the applicability and success of combining both the Levenshtein edit distance approximate string matching algorithm and the term frequency inverse document frequency (TF-IDF), thereby boosting the rate of similarity measured using cosine similarity. The proposed hybrid algorithm is also able to detect plagiarism occurred on natural language, source codes, exact, and disguised words. The developed algorithm can detect rearranged words, inter-textual similarity of insertion or deletion and grammatical changes. In this research three various dataset are used for testing: automated machine paragraphs, mistyped words and java source codes. Overall, the system proved to be detecting plagiarism better than the yet alone TF-IDF approach.
1 Semantic Scholar 
2 refSeek 
3 BibSonomy 
4 J-Gate 
5 Scribd 
6 SlideShare 
A. Ekbal, S. Saha, and G. Choudhary, "Plagiarism detection in text using vector space model," in 2012 12th International Conference on Hybrid Intelligent Systems (HIS), 2012: IEEE, pp. 366-371.
A. H. Lubis, A. Ikhwan, and P. L. E. Kan, "Combination of levenshtein distance and rabin-karp to improve the accuracy of document equivalence level," International Journal of Engineering & Technology, vol. 7, no. 2.27, pp. 17-21, 2018.
A. Mahmoud and M. Zrigui, "Semantic similarity analysis for paraphrase identification in Arabic texts," in Proceedings of the 31st Pacific Asia conference on language, information and computation, 2017, pp. 274-281.
A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, "Cosine similarity to determine similarity measure: Study case in online essay assessment," in 2016 4th International Conference on Cyber and IT Service Management, 2016: IEEE, pp. 1-6.
A. Zouhir, R. El Ayachi, and M. Biniz, "A comparative Plagiarism Detection System methods between sentences," in Journal of Physics: Conference Series. Vol. 1743. No. 1. IOP Publishing, 2021
Baltes, Sebastian. Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects — Supplementary Material (Version 2017-01-15) [Data set], 2018.
C. Ragkhitwetsagul, J. Krinke, and D. Clark, "A comparison of code similarity analysers," Empirical Software Engineering, vol. 23, no. 4, pp. 2464-2519, 2018.
D. Gupta, "Study on Extrinsic Text Plagiarism Detection Techniques and Tools," Journal of Engineering Science & Technology Review, vol. 9, no. 5, 2016.
F. K. AL-Jibory, "Hybrid System for Plagiarism Detection on A Scientific Paper," Turkish Journal of Computer and Mathematics Education (TURCOMAT), vol. 12, no. 13, pp. 5707-5719, 2021.
Foltynek, T., Ruas, T., Scharpf, P., Meuschke, N., Schubotz, M., Grosky, W., Gipp, B. Detecting Machine-obfuscated Plagiarism [Data set], University of Michigan - Deep Blue, 2019.
H. Cherroun and A. Alshehri, "Disguised plagiarism detection in Arabic text documents," in 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), 2018: IEEE, pp. 1-6.
K. Al-Khamaiseh and S. ALShagarin, "A survey of string matching algorithms," Int. J. Eng. Res. Appl, vol. 4, no. 7, pp. 144-156, 2014.
L. Qinqin and Z. Chunhai, "Research on algorithm of program code similarity detection," in 2017 International Conference on Computer Systems, Electronics and Control (ICCSEC), 2017: IEEE, pp. 1289-1292.
M. Y. M. Chong, "A study on plagiarism detection and plagiarism direction identification using natural language processing techniques," 2013.
N. Gali, R. Mariescu-Istodor, and P. Fränti, "Similarity measures for title matching," in 2016 23rd International Conference on Pattern Recognition (ICPR), 2016: IEEE, pp. 1548-1553.
R. R. Naik, M. B. Landge, and C. N. Mahender, "A review on plagiarism detection tools," International Journal of Computer Applications, vol. 125, no. 11, 2015.
S. Rani and J. Singh, "Enhancing Levenshtein’s edit distance algorithm for evaluating document similarity," in International Conference on Computing, Analytics and Networks, 2017: Springer, pp. 72-80.
S. Wang, H. Qi, L. Kong, and C. Nu, "Combination of VSM and Jaccard coefficient for external plagiarism detection," in 2013 international conference on machine learning and cybernetics, 2013, vol. 4: IEEE, pp. 1880-1885.
T. El-Shishtawy, "A hybrid algorithm for matching arabic names," arXiv preprint arXiv:1309.5657, 2013.
W. Ali, Z. Rehman, A. U. Rehman, and M. Slaman, "Detection of plagiarism in Urdu text documents," in 2018 14th International Conference on Emerging Technologies (ICET), 2018: IEEE, pp. 1-6.
Wahle, Jan Philip, Ruas, Terry, Foltynek, Tomas, Meuschke, Norman, & Gipp, Bela. Identifying Machine-Paraphrased Plagiarism (Version 1.0) [Data set], 2021.
X. Liu, C. Xu, and B. Ouyang, "Plagiarism detection algorithm for source code in computer science education," International Journal of Distance Education Technologies (IJDET), vol. 13, no. 4, pp. 29-39, 2015.
X. Wang, S. Ju, and S. Wu, "Challenges in Chinese text similarity research," in 2008 International Symposiums on Information Processing, 2008: IEEE, pp. 297-302.
Miss Zina Balani
Department of Software Engineering, Koya University, Koy sinjaq, 44023 - Iraq
Associate Professor Cihan Varol
Department of Computer Science, Sam Houston State University, Huntsville, TX 77341 - United States of America