Call for Papers - Ongoing round of submission, notification and publication.
    
  
Home    |    Login or Register    |    Contact CSC
By Title/Keywords/Abstract   By Author
Browse CSC-OpenAccess Library.
  • HOME
  • LIST OF JOURNALS
  • AUTHORS
  • EDITORS & REVIEWERS
  • LIBRARIANS & BOOK SELLERS
  • PARTNERSHIP & COLLABORATION
Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available
(no registration required)

(1005.09KB)


-- CSC-OpenAccess Policy
-- Creative Commons Attribution NonCommercial 4.0 International License
>> COMPLETE LIST OF JOURNALS

EXPLORE PUBLICATIONS BY COUNTRIES

EUROPE
MIDDLE EAST
ASIA
AFRICA
.............................
United States of America
United Kingdom
Canada
Australia
Italy
France
Brazil
Germany
Malaysia
Turkey
China
Taiwan
Japan
Saudi Arabia
Jordan
Egypt
United Arab Emirates
India
Nigeria
Improved Count Suffix Trees for Natural Language Data
Guido Sautter, Klemens Böhm
Pages - 1 - 27     |    Revised - 15-01-2012     |    Published - 21-02-2012
Published in International Journal of Data Engineering (IJDE)
Volume - 3   Issue - 1    |    Publication Date - February 2012  Table of Contents
MORE INFORMATION
References   |   Cited By (1)   |   Abstracting & Indexing
KEYWORDS
Selectivity Estimation, Count Suffix Tree, Query Optimization
ABSTRACT
With more and more text data stored in databases, the problem of handling natural language query predicates becomes highly important. Closely related to query optimization for these predicates is the (sub)string estimation problem, i.e., estimating the selectivity of query terms before query execution based on small summary statistics. The Count Suffix Trees (CST) is the data structure commonly used to address this problem. While selectivity estimates based on CST tend to be good, they are computationally expensive to build and require a large amount of memory for storage. To fit CST into the data dictionary of database systems, they have to be pruned severely. Pruning techniques proposed so far are based on term (suffix) frequency or on the tree depth of nodes. In this paper, we propose new filtering and pruning techniques that reduce the building cost and the size of CST over natural-language texts. The core idea is to exploit the features of the natural language data over which the CST is built. In particular, we aim at regarding only those suffixes that are useful in a linguistic sense. We use (wellknown) IR techniques to identify them. The most important innovations are as follows: (a) We propose and use a new optimistic syllabification technique to filter out suffixes. (b) We introduce a new affix and prefix stripping procedure that is more aggressive than conventional stemming techniques, which are commonly used to reduce the size of indices. (c) We observe that misspellings and other language anomalies like foreign words incur an over-proportional growth of the CST. We apply state-of-the-art trigram techniques as well as a new syllable-based non-word detection mechanism to filter out such substrings. – Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80%, already during construction, and at the same time increase the accuracy of selectivity estimates computed from the final CST by up to 70%.
CITED BY (1)  
1 Tahir, d. n. international journal of data engineering (ijde).
ABSTRACTING & INDEXING
1 Google Scholar 
2 CiteSeerX 
3 Scribd 
4 SlideShare 
5 PdfSR 
REFERENCES
D. D. Lewis. Reuters-21578 [online] Available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/.
D. Graff. “The Aquaint corpus of english news text”. Linguistic Data Consortium, Philadelphia, 2002.
D. W. Cummings. “American English spelling: an informal description”. Johns Hopkins University Press, 1988.
E. M. McCreight. ”A space-economical suffix tree construction algorithm”. J. Assoc. Comput. Mach., 23(2): 262–272, 1976.
E. M. Zamora, J. Pollock, and A. Zamora. “The use of trigram analysis for spelling error detection”. Information of Processing and Management, 17: 305–316, 1981.
E. Ukkonen. “On-line construction of suffix trees”. Algorithmica, 14(3): 249–260, 1995.
F. M. Lian. “Word hy-phen-a-tion by com-put-er”. PhD thesis, Stanford University, Stanford, August 1983.
G. Sautter, C. Abba, K. Böhm. “Improved Count Suffix Trees for Natural Language Data”. In Proceedings of IDEAS 2008, Coimbra, Portugal, 2008
H. Jagadish, O. Kapitskaia, and D. Srivastava. “One-dimensional and multi-dimensional substring selectivity estimation”. The International Journal on Very Large Data Bases, 9(3): 214–230, 2000.
J. B. Lovins. “Development of a stemming algorithm”. Mechanical Translation and Computational Linguistics, 11:22–31, 1968.
J. Bae and S. Lee. “Substring count estimation in extremely long strings”. IEICE – Transactions in Information Systems, E89-D(3):1148–1156, 2006.
K. Kukich. “Techniques for automatically correcting words in text”. ACM Computer Surveys, 24:379–439, 1992.
M. F. Porter. ”An algorithm for suffix stripping”. Program, 14(3):130–137, 1980.
M. Lennon, D. Pierce, B. Tarry, and P. Willett. “An evaluation of some conflation algorithms for information retrieval”. Journal of Information Science, 3(177–183), 1981.
P. Krishnan, J. S. Vitter, and B. Iyer. “Estimating alphanumeric selectivity in the presence of wildcards”. In ACM SIGMOD International Conference on Management of Data, pages 12–13. ACM, 1996.
P. Weiner. “Linear pattern matching algorithms”. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1–11, 1973.
Personal communication with Torsten Grabs, Microsoft SQL Server development team.
R. Baeza-Yates and B. Ribeiro-Neto. “Modern information retrieval”. Addison-Wesley Longman, 1. print. edition, 1999.
R. Giegerich, S. Kurtz, J. Stoye. “Efficient Implementation of Lazy Suffix Trees”. Software: Practice and Experience, Volume 33, No 11, John Wiley & Sons Ltd., 2003
R. Krovetz. “Viewing morphology as an inference process”. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, 1993.
S. Bressan and R. Irawan. “Morphologic non-word error detection”. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA ’04), pages 31–35, 2004.
S. Chaudhuri, V. Ganti, and L. Gravano. “Selectivity estimation for string predicates: Overcoming the underestimation problem”. In Proceedings of ICDE 2004, Boston, MA, USA, 2004.
Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. “Practical methods for constructing suffix trees”. The International Journal on Very Large Data Bases, 14(3): 281–299, 2005.
Z. Chen, F. Korn, N. Koudas, S. Muthukrishnan. “Selectivity Estimation for Boolean Queries”. In Proceedings of PODS 2000, Dallas, TX, USA, 2000
MANUSCRIPT AUTHORS
Dr. Guido Sautter
KIT - Germany
sautter@ipd.uka.de
Professor Klemens Böhm
Karlsruhe Institute of Technology - Germany


CREATE AUTHOR ACCOUNT
 
LAUNCH YOUR SPECIAL ISSUE
View all special issues >>
 
PUBLICATION VIDEOS
 
You can contact us anytime since we have 24 x 7 support.
Join Us|List of Journals|
    
Copyrights © 2025 Computer Science Journals (CSC Journals). All rights reserved. Privacy Policy | Terms of Conditions