Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(156.5KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
Publications from CSC-OpenAccess Library are being accessed from over 74 countries worldwide.
Indexing for Large DNA Database sequences
Samer Wohoush, Mahmoud Saheb
Pages - 202 - 215     |    Revised - 01-09-2011     |    Published - 05-10-2011
Volume - 5   Issue - 4    |    Publication Date - September / October 2011  Table of Contents
MORE INFORMATION
KEYWORDS
Large Database, DNA Sequence, Index Structure, Sequence Transformation, Wavelet Transformation, RDMS Indexing
ABSTRACT
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
1 Google Scholar 
2 Academic Journals Database 
3 CiteSeerX 
4 refSeek 
5 Bielefeld Academic Search Engine (BASE) 
6 Scribd 
7 SlideShare 
8 PdfSR 
1 Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases. Ozgur Ozturk Hakan Ferhatosmanoglu bibe, pp.359, Third IEEE Symposium on BioInformatics and BioEngineering (BIBE'03), 2003.
2 An efficient similarity search based on indexing in large DNA databases, In-Seon Jeong, Kyoung-Wook Park, Seung-Ho Kang, Hyeong-Seok Lim, 2010.
3 An Efficient Index Structure for String Databases. Tamer Kahveci Ambuj K. Singh Department of Computer Science, University of California Santa Barbara, CA 93106 {amer,ambuj}cs.ucsb.edu, 2001.
4 Fast Dynamic Programming Based Sequence Alignment Algorithm. Nur'Aini Abdul Rashid', Rosni Abdullah, Abdullah Zawawi Haji Talib, Zalila Ali, IEEE, 2006.
5 MAP: Searching Large Genome Databases. T. Kahveci, A. Singh Pacific Symposium on Biocomputing 8:303-314(2003).
6 Indexing and retrieval for genomic database.Hugh E. Williams, Member, IEEE, and Justin Zobel, Member, IEEE Computer Society, IEEE, 2002.
7 S. Muthukrishnan and S. C. Sahinalp. Approximate nearest neighbor and sequence comparison with block operations, 2000.
8 CoMRI: A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries. Hong Sun1, Ozgur Ozturk1, Hakan Ferhatosmano glu, IEEE, 2003.
9 E. Giladi et al., SST: An Algorithm for Finding Near-Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size. Bioinformatics 18, 873877, 2002.
10 An Efficient Approach for Building Compressed Full-text Index for structured Data: Jun Liang, Lin Xiao, Di Zhang IEEE, 2009.
11 Efficient Maintenance Schema of Inverted Index for Large-scale Full-Text Retrieval, Xiaozhu Liu, State Key Lab of Software Engineering Wuhan University Wuhan 430072, China , School of Automation Wuhan University of Technology IEEE, 2010.
12 Mathematical Extension of Full Text Search Engine, Jozef Misutka, Leo Galambos, Department of Software Engineering, Charles University in Prague, Ke Karlovu 3, 121 16 Prague, Czech Republic, 2008.
13 Experimental Simulation on Incremental Three-gram Index for Two-gram Full-Text Search Systems, Hiroshi Yamamoto Seishiro Ohmi Hiroshi Tsuji IEEE, 2003.
14 A Compact Memory Space of Dynamic Full-Text Search using Bi-Gram Index, El-Sayed Atlam, El-Marhomy Ghada, Masao Fuketa, Kazuhiro Morita and Jun-ichi Aoe, Department of Information Science and Intelligent Systems, University of Tokushima Tokushima,770-8506, Japan 2004.
15 Breaking a Time-and-Space Barrier in Constructing Full-Text Indices, Wing-Kai Hon, Kunihiko Sadakane_ Wing-Kin Sung IEEE, 2003.
16 Parallel Selection Query Processing Involving Index in Parallel Database Systems. J. Wenny Rahayu David Taniar, IEEE, 2002.
17 An Architecture for Parallel Search of Large, Full-text Databases, Nassrin Tavakoli and Hassan Modaress-Razavi, Department of Computer Science, The University of North Carolina at Charlotte, Charlotte, NC 28223 IEEE, 1990.
18 An Ontology Enhanced Development Kit for Full Text Search, Su Jian, Weng Wenyong, Wang Zebing, Lab of Digital City & Electronic Service, Zhejiang University City College, Hangzhou 310015, China IEEE, 2009.
19 Alexander Rubin, Senior Consultant, MySQL AB, Full Text Search in MySQL 5.1 New Features and HowTo, http://www.mysqlfulltextsearch.com/full_text.pdf, 2006.
20 Moshe Shadmon, The ScaleDB Storage Engine, http://www.scaledb.com/pdfs/ScaleDB_MySQL_Preso2009.ppt, 2009.
21 A Hybird Method for Efficient Indexing of XML Documents. Sun Wei, Da-xin Lui, IEEE, 2005.
22 The SBCTree: An Index for RunLength Compressed Sequences, Mohamed Y. Eltabakh ,Wing-Kai Hon, Rahul Shah, Walid G. Aref, Jeffrey S. Vitter Purdue University, 2008, 2008
23 Efficient Filtration of Sequence Similarity Search Through Singular Value Decomposition. S. Alireza Aghili Ozgur D. Sahin Divyakant Agrawal Amr El Abbadi, IEEE 2004.
Mr. Samer Wohoush
Palestine Polytechnic University - Palestinian Occupied Territori
samer_wh@ppu.edu
Associate Professor Mahmoud Saheb
Palestine Polytechnic University - Palestinian Occupied Territori