Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

This is an Open Access publication published under CSC-OpenAccess Policy.
Improved Count Suffix Trees for Natural Language Data
Guido Sautter, Klemens Böhm
Pages - 1 - 27     |    Revised - 15-01-2012     |    Published - 21-02-2012
Volume - 3   Issue - 1    |    Publication Date - February 2012  Table of Contents
Selectivity Estimation, Count Suffix Tree, Query Optimization
With more and more text data stored in databases, the problem of handling natural language query predicates becomes highly important. Closely related to query optimization for these predicates is the (sub)string estimation problem, i.e., estimating the selectivity of query terms before query execution based on small summary statistics. The Count Suffix Trees (CST) is the data structure commonly used to address this problem. While selectivity estimates based on CST tend to be good, they are computationally expensive to build and require a large amount of memory for storage. To fit CST into the data dictionary of database systems, they have to be pruned severely. Pruning techniques proposed so far are based on term (suffix) frequency or on the tree depth of nodes. In this paper, we propose new filtering and pruning techniques that reduce the building cost and the size of CST over natural-language texts. The core idea is to exploit the features of the natural language data over which the CST is built. In particular, we aim at regarding only those suffixes that are useful in a linguistic sense. We use (wellknown) IR techniques to identify them. The most important innovations are as follows: (a) We propose and use a new optimistic syllabification technique to filter out suffixes. (b) We introduce a new affix and prefix stripping procedure that is more aggressive than conventional stemming techniques, which are commonly used to reduce the size of indices. (c) We observe that misspellings and other language anomalies like foreign words incur an over-proportional growth of the CST. We apply state-of-the-art trigram techniques as well as a new syllable-based non-word detection mechanism to filter out such substrings. – Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80%, already during construction, and at the same time increase the accuracy of selectivity estimates computed from the final CST by up to 70%.
CITED BY (1)  
1 Tahir, d. n. international journal of data engineering (ijde).
1 Google Scholar
2 CiteSeerX
3 Scribd
4 SlideShare
5 PdfSR
1 J. Bae and S. Lee. “Substring count estimation in extremely long strings”. IEICE – Transactions in Information Systems, E89-D(3):1148–1156, 2006.
2 R. Baeza-Yates and B. Ribeiro-Neto. “Modern information retrieval”. Addison-Wesley Longman, 1. print. edition, 1999.
3 S. Bressan and R. Irawan. “Morphologic non-word error detection”. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA ’04), pages 31–35, 2004.
4 S. Chaudhuri, V. Ganti, and L. Gravano. “Selectivity estimation for string predicates: Overcoming the underestimation problem”. In Proceedings of ICDE 2004, Boston, MA, USA, 2004.
5 D. W. Cummings. “American English spelling: an informal description”. Johns Hopkins University Press, 1988.
6 D. Graff. “The Aquaint corpus of english news text”. Linguistic Data Consortium, Philadelphia, 2002.
7 H. Jagadish, O. Kapitskaia, and D. Srivastava. “One-dimensional and multi-dimensional substring selectivity estimation”. The International Journal on Very Large Data Bases, 9(3): 214–230, 2000.
8 P. Krishnan, J. S. Vitter, and B. Iyer. “Estimating alphanumeric selectivity in the presence of wildcards”. In ACM SIGMOD International Conference on Management of Data, pages 12–13. ACM, 1996.
9 R. Krovetz. “Viewing morphology as an inference process”. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, 1993.
10 K. Kukich. “Techniques for automatically correcting words in text”. ACM Computer Surveys, 24:379–439, 1992.
11 M. Lennon, D. Pierce, B. Tarry, and P. Willett. “An evaluation of some conflation algorithms for information retrieval”. Journal of Information Science, 3(177–183), 1981.
12 D. D. Lewis. Reuters-21578 [online] Available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/.
13 F. M. Lian. “Word hy-phen-a-tion by com-put-er”. PhD thesis, Stanford University, Stanford, August 1983.
14 J. B. Lovins. “Development of a stemming algorithm”. Mechanical Translation and Computational Linguistics, 11:22–31, 1968.
15 E. M. McCreight. ”A space-economical suffix tree construction algorithm”. J. Assoc. Comput. Mach., 23(2): 262–272, 1976.
16 M. F. Porter. ”An algorithm for suffix stripping”. Program, 14(3):130–137, 1980.
17 Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. “Practical methods for constructing suffix trees”. The International Journal on Very Large Data Bases, 14(3): 281–299, 2005.
18 E. Ukkonen. “On-line construction of suffix trees”. Algorithmica, 14(3): 249–260, 1995.
19 P. Weiner. “Linear pattern matching algorithms”. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1–11, 1973.
20 E. M. Zamora, J. Pollock, and A. Zamora. “The use of trigram analysis for spelling error detection”. Information of Processing and Management, 17: 305–316, 1981.
21 Z. Chen, F. Korn, N. Koudas, S. Muthukrishnan. “Selectivity Estimation for Boolean Queries”. In Proceedings of PODS 2000, Dallas, TX, USA, 2000
22 R. Giegerich, S. Kurtz, J. Stoye. “Efficient Implementation of Lazy Suffix Trees”. Software: Practice and Experience, Volume 33, No 11, John Wiley & Sons Ltd., 2003
23 G. Sautter, C. Abba, K. Böhm. “Improved Count Suffix Trees for Natural Language Data”. In Proceedings of IDEAS 2008, Coimbra, Portugal, 2008
24 Personal communication with Torsten Grabs, Microsoft SQL Server development team.
Dr. Guido Sautter
KIT - Germany
Professor Klemens Böhm
Karlsruhe Institute of Technology - Germany