Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(148.35KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
Language Identifier for Languages of Pakistan Including Arabic and Persian
Qaiser Abbas, M. S. Ahmad, Sadia Niazi
Pages - 27 - 35     |    Revised - 30-11-2010     |    Published - 20-12-2010
Volume - 1   Issue - 3    |    Publication Date - December 2010  Table of Contents
MORE INFORMATION
KEYWORDS
, Identifier, Probabilistic, HAIL, Digrams, LIJ
ABSTRACT
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
CITED BY (5)  
1 Abbas, Q. (2014, August). Semi-semantic part of speech annotation and evaluation. In Proceedings of ACL 8th Linguistic Annotation Workshop held in conjunction with COLING, Association of Computational Linguistics, P (pp. 75-81).
2 Abbas, Q. (2014). Building Computational Resources: The URDU. KON-TB Treebank and the Urdu Parser (Doctoral dissertation).
3 Abbas, Q. (2014). A Stochastic Prediction Interface for Urdu. International Journal of Intelligent Systems and Applications (IJISA), 7(1), 94.
4 Khanam, M. H. experiments in probabilistic context free grammar for urdu language.
5 Abbas, Q. (2012). Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank. In Computational Linguistics and Intelligent Text Processing (pp. 66-79). Springer Berlin Heidelberg.
1 Google Scholar
2 CiteSeerX
3 refSeek
4 Scribd
5 SlideShare
6 PdfSR
1 Charles M. Kastner, G. Adam Covington, Andrew A. Levine, John W. Lockwood, “HAIL: A HARDWARE-ACCELERATED ALGORITHM FOR LANGUAGE IDENTIFICATION”, 15 th Annual conference on Field Programmable Logic and Applications (FPL), USA, 2005.
2 V. Berlian, S.N. Vega, and S. Bressan, “Indexing the Indonesian web: Language identification and miscellaneous issues”, In the Tenth International World Wide Web Conference, Hong Kong, 2001
3 Gary Adams and Philip Resnik. “A language identification application built on the Java clientserver platform”. In Jill Burstein and Claudia Leacock, editors, From Research to Commercial Applications: Making NLP Work in Practice, pages 43--47. Association for Computational Linguistics, 1997
4 K. R. Beesley. “Language identifier: A computer program for automatic natural-language identification on on-line text”. In Proceedings of the 29th Annual Conference of the American Translators Association, pages 47—54, USA, 1988.
5 Tejinder Singh Saini1 and Gurpreet Singh Lehal2, “Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach”, Research in Computing Science (Mexico), Vol-33, Pages 151-162. USA, 2008.
6 J. Lockwood, J. Turner, and D. Taylor, “Field Programmable Port Extender (FPX) for Distributed Routing and Queuing” in ACM International Symposium on Field Programmable Gate Arrays (FPGA), 2000.
7 Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert. “Language identification from text using n-gram based cumulative frequency addition”. In Proc. of CSIS Research Day, pages 12.1–12.8, Pace University, NY, 2004.
8 D. Schuehler and J. Lockwood, “A Modular System for FPGA-based TCP Flow Processing in High-Speed Network,” in 14th International Conference on Field Programmable Logic and Applications (FPL), Antwerp, Belgium, pp. 301–310, 2004.
9 Cavnar, William B., Trenkle, M. “N-gram based text categorization”, InProceedings of the third Annual Symposium on Document Analysis and Information Retrieval, pp161-169, 1994.
10 Hussain, S., Karamat N., Mansoor, A. “Arabic Script Internationalized Domain Names”, In the Proceedings of the CIIT Workshop on Research in Computing, CWRC’08, CIIT Lahore, Pakistan, 2008.
11 M.G.A. Malik, “Towards Unicode Compatible Punjabi Character Set”, Proceeding of 27 th Internationalization and Unicode Conference, Berlin, Germany, 2005,.
12 ] Hussain, S. “Urdu Collation Sequence”, In the Proceedings of the IEEE International MultiTopic Conference, Islamabad, Pakistan, 2003.
13 Hussain, S. “Computational Linguistics in Pakistan: Issues and Proposals”, In the Proceedings of EACL (Workshop in Computational Linguistics for Languages of South Asia), Hungary, 2003.
14 C. Kruengkrai, P. Srichaivattana, V. Sornlertlamvanich, and H. Isahara. “Language identification based on string kernels”. In Proceedings of the 5th International Symposium on Communications and Information Technologies, 2005.
15 Hisham El-Shishiny, Alexander Troussov, “Word Fragments Based Arabic Language Identification”, NEMLAR, Arabic language Resources and Tools Conference, Cairo, Egypt, 2004.
Mr. Qaiser Abbas
University of Sargodha - Pakistan
qaiser.abbas@uos.edu.pk
Mr. M. S. Ahmad
University of Sargodha - Pakistan
Mr. Sadia Niazi
University of Sargodha - Pakistan