Analysis of n-Gram based promoter recognition methods and application to whole genome promoter prediction

No Thumbnail Available
Date
2009-07-08
Authors
Rani, T. Sobha
Bapi, Raju S.
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Promoter prediction is an important and complex problem. Pattern recognition algorithms typically require features that could capture this complexity. A special bias towards certain combinations of base pairs in the promoter sequences may be possible. In order to determine these biases n-grams are usually extracted and analyzed. An n-gram is a selection of n contiguous characters from a given character stream, DNA sequence segments in this case. Here a systematic study is made to discover the efficacy of n-grams for n = 2, 3, 4, 5 in promoter prediction. A study of n-grams as features for a neural network classifier for E. coli and Drosophila promoters is made. In case of E. coli n = 3 and in case of Drosophila n = 4 seem to give optimal prediction values. Using the 3-gram features, promoter prediction in the genome sequence of E. coli is done. The results are encouraging in positive identification of promoters in the genome compared to software packages such as BPROM, NNPP, and SAK. Whole genome promoter prediction in Drosophila genome was also performed but with 4-gram features. © 2009 IOS Press. All rights reserved.
Description
Keywords
Binary classification, Biological data sets, Cascaded classifiers, In silico method for promoter prediction, Machine learning method, Neural networks
Citation
In Silico Biology. v.9(1-2)