EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Language Identifier for Languages of Pakistan Including Arabic and Persian

Qaiser Abbas, M. S. Ahmad, Sadia Niazi

Pages - 27 - 35 | Revised - 30-11-2010 | Published - 20-12-2010

Published in International Journal of Computational Linguistics (IJCL)

Volume - 1 Issue - 3 | Publication Date - December 2010 Table of Contents

MORE INFORMATION

References | Cited By (5) | Abstracting & Indexing

KEYWORDS

, Identifier, Probabilistic, HAIL, Digrams, LIJ

ABSTRACT

Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.

CITED BY (5)

1	Abbas, Q. (2014, August). Semi-semantic part of speech annotation and evaluation. In Proceedings of ACL 8th Linguistic Annotation Workshop held in conjunction with COLING, Association of Computational Linguistics, P (pp. 75-81).

2	Abbas, Q. (2014). Building Computational Resources: The URDU. KON-TB Treebank and the Urdu Parser (Doctoral dissertation).

3	Abbas, Q. (2014). A Stochastic Prediction Interface for Urdu. International Journal of Intelligent Systems and Applications (IJISA), 7(1), 94.

4	Khanam, M. H. experiments in probabilistic context free grammar for urdu language.

5	Abbas, Q. (2012). Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank. In Computational Linguistics and Intelligent Text Processing (pp. 66-79). Springer Berlin Heidelberg.

ABSTRACTING & INDEXING

1	Google Scholar

2	CiteSeerX

3	refSeek

4	Scribd

5	SlideShare

6	PdfSR

REFERENCES

Gary Adams and Philip Resnik. “A language identification application built on the Java clientserver platform”. In Jill Burstein and Claudia Leacock, editors, From Research to Commercial Applications: Making NLP Work in Practice, pages 43--47. Association for Computational Linguistics, 1997

Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert. “Language identification from text using n-gram based cumulative frequency addition”. In Proc. of CSIS Research Day, pages 12.1–12.8, Pace University, NY, 2004.

C. Kruengkrai, P. Srichaivattana, V. Sornlertlamvanich, and H. Isahara. “Language identification based on string kernels”. In Proceedings of the 5th International Symposium on Communications and Information Technologies, 2005.

Cavnar, William B., Trenkle, M. “N-gram based text categorization”, InProceedings of the third Annual Symposium on Document Analysis and Information Retrieval, pp161-169, 1994.

Charles M. Kastner, G. Adam Covington, Andrew A. Levine, John W. Lockwood, “HAIL: A HARDWARE-ACCELERATED ALGORITHM FOR LANGUAGE IDENTIFICATION”, 15 th Annual conference on Field Programmable Logic and Applications (FPL), USA, 2005.

D. Schuehler and J. Lockwood, “A Modular System for FPGA-based TCP Flow Processing in High-Speed Network,” in 14th International Conference on Field Programmable Logic and Applications (FPL), Antwerp, Belgium, pp. 301–310, 2004.

Hisham El-Shishiny, Alexander Troussov, “Word Fragments Based Arabic Language Identification”, NEMLAR, Arabic language Resources and Tools Conference, Cairo, Egypt, 2004.

Hussain, S. “Computational Linguistics in Pakistan: Issues and Proposals”, In the Proceedings of EACL (Workshop in Computational Linguistics for Languages of South Asia), Hungary, 2003.

Hussain, S., Karamat N., Mansoor, A. “Arabic Script Internationalized Domain Names”, In the Proceedings of the CIIT Workshop on Research in Computing, CWRC’08, CIIT Lahore, Pakistan, 2008.

J. Lockwood, J. Turner, and D. Taylor, “Field Programmable Port Extender (FPX) for Distributed Routing and Queuing” in ACM International Symposium on Field Programmable Gate Arrays (FPGA), 2000.

K. R. Beesley. “Language identifier: A computer program for automatic natural-language identification on on-line text”. In Proceedings of the 29th Annual Conference of the American Translators Association, pages 47—54, USA, 1988.

M.G.A. Malik, “Towards Unicode Compatible Punjabi Character Set”, Proceeding of 27 th Internationalization and Unicode Conference, Berlin, Germany, 2005,.

Tejinder Singh Saini1 and Gurpreet Singh Lehal2, “Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach”, Research in Computing Science (Mexico), Vol-33, Pages 151-162. USA, 2008.

V. Berlian, S.N. Vega, and S. Bressan, “Indexing the Indonesian web: Language identification and miscellaneous issues”, In the Tenth International World Wide Web Conference, Hong Kong, 2001

] Hussain, S. “Urdu Collation Sequence”, In the Proceedings of the IEEE International MultiTopic Conference, Islamabad, Pakistan, 2003.

MANUSCRIPT AUTHORS

Mr. Qaiser Abbas

University of Sargodha - Pakistan

qaiser.abbas@uos.edu.pk

Mr. M. S. Ahmad

University of Sargodha - Pakistan

Mr. Sadia Niazi

University of Sargodha - Pakistan

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS