Home   >   CSC-OpenAccess Library   >    Manuscript Information
Improved Count Suffix Trees for Natural Language Data
Guido Sautter, Klemens Böhm
Pages - 1 - 27     |    Revised - 15-01-2012     |    Published - 21-02-2012
Volume - 3   Issue - 1    |    Publication Date - February 2012  Table of Contents
MORE INFORMATION
KEYWORDS
Selectivity Estimation, Count Suffix Tree, Query Optimization
ABSTRACT
With more and more text data stored in databases, the problem of handling natural language query predicates becomes highly important. Closely related to query optimization for these predicates is the (sub)string estimation problem, i.e., estimating the selectivity of query terms before query execution based on small summary statistics. The Count Suffix Trees (CST) is the data structure commonly used to address this problem. While selectivity estimates based on CST tend to be good, they are computationally expensive to build and require a large amount of memory for storage. To fit CST into the data dictionary of database systems, they have to be pruned severely. Pruning techniques proposed so far are based on term (suffix) frequency or on the tree depth of nodes. In this paper, we propose new filtering and pruning techniques that reduce the building cost and the size of CST over natural-language texts. The core idea is to exploit the features of the natural language data over which the CST is built. In particular, we aim at regarding only those suffixes that are useful in a linguistic sense. We use (wellknown) IR techniques to identify them. The most important innovations are as follows: (a) We propose and use a new optimistic syllabification technique to filter out suffixes. (b) We introduce a new affix and prefix stripping procedure that is more aggressive than conventional stemming techniques, which are commonly used to reduce the size of indices. (c) We observe that misspellings and other language anomalies like foreign words incur an over-proportional growth of the CST. We apply state-of-the-art trigram techniques as well as a new syllable-based non-word detection mechanism to filter out such substrings. – Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80%, already during construction, and at the same time increase the accuracy of selectivity estimates computed from the final CST by up to 70%.
CITED BY (1)  
1 Tahir, d. n. international journal of data engineering (ijde).
1 Google Scholar 
2 CiteSeerX 
3 Scribd 
4 SlideShare 
5 PdfSR 
D. D. Lewis. Reuters-21578 [online] Available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/.
D. Graff. “The Aquaint corpus of english news text”. Linguistic Data Consortium, Philadelphia, 2002.
D. W. Cummings. “American English spelling: an informal description”. Johns Hopkins University Press, 1988.
E. M. McCreight. ”A space-economical suffix tree construction algorithm”. J. Assoc. Comput. Mach., 23(2): 262–272, 1976.
E. M. Zamora, J. Pollock, and A. Zamora. “The use of trigram analysis for spelling error detection”. Information of Processing and Management, 17: 305–316, 1981.
E. Ukkonen. “On-line construction of suffix trees”. Algorithmica, 14(3): 249–260, 1995.
F. M. Lian. “Word hy-phen-a-tion by com-put-er”. PhD thesis, Stanford University, Stanford, August 1983.
G. Sautter, C. Abba, K. Böhm. “Improved Count Suffix Trees for Natural Language Data”. In Proceedings of IDEAS 2008, Coimbra, Portugal, 2008
H. Jagadish, O. Kapitskaia, and D. Srivastava. “One-dimensional and multi-dimensional substring selectivity estimation”. The International Journal on Very Large Data Bases, 9(3): 214–230, 2000.
J. B. Lovins. “Development of a stemming algorithm”. Mechanical Translation and Computational Linguistics, 11:22–31, 1968.
J. Bae and S. Lee. “Substring count estimation in extremely long strings”. IEICE – Transactions in Information Systems, E89-D(3):1148–1156, 2006.
K. Kukich. “Techniques for automatically correcting words in text”. ACM Computer Surveys, 24:379–439, 1992.
M. F. Porter. ”An algorithm for suffix stripping”. Program, 14(3):130–137, 1980.
M. Lennon, D. Pierce, B. Tarry, and P. Willett. “An evaluation of some conflation algorithms for information retrieval”. Journal of Information Science, 3(177–183), 1981.
P. Krishnan, J. S. Vitter, and B. Iyer. “Estimating alphanumeric selectivity in the presence of wildcards”. In ACM SIGMOD International Conference on Management of Data, pages 12–13. ACM, 1996.
P. Weiner. “Linear pattern matching algorithms”. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1–11, 1973.
Personal communication with Torsten Grabs, Microsoft SQL Server development team.
R. Baeza-Yates and B. Ribeiro-Neto. “Modern information retrieval”. Addison-Wesley Longman, 1. print. edition, 1999.
R. Giegerich, S. Kurtz, J. Stoye. “Efficient Implementation of Lazy Suffix Trees”. Software: Practice and Experience, Volume 33, No 11, John Wiley & Sons Ltd., 2003
R. Krovetz. “Viewing morphology as an inference process”. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, 1993.
S. Bressan and R. Irawan. “Morphologic non-word error detection”. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA ’04), pages 31–35, 2004.
S. Chaudhuri, V. Ganti, and L. Gravano. “Selectivity estimation for string predicates: Overcoming the underestimation problem”. In Proceedings of ICDE 2004, Boston, MA, USA, 2004.
Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. “Practical methods for constructing suffix trees”. The International Journal on Very Large Data Bases, 14(3): 281–299, 2005.
Z. Chen, F. Korn, N. Koudas, S. Muthukrishnan. “Selectivity Estimation for Boolean Queries”. In Proceedings of PODS 2000, Dallas, TX, USA, 2000
Dr. Guido Sautter
KIT - Germany
sautter@ipd.uka.de
Professor Klemens Böhm
Karlsruhe Institute of Technology - Germany