Finding optimal degenerate patterns in DNA sequences

Daisuke Shinozaki, Tatsuya Akutsu, Osamu Maruyama

Research output: Contribution to journalArticle

14 Citations (Scopus)

Abstract

Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

Original languageEnglish
Pages (from-to)ii206-ii214
JournalBioinformatics
Volume19
Issue numberSUPPL. 2
DOIs
Publication statusPublished - Dec 1 2003

Fingerprint

DNA sequences
DNA Sequence
Genes
Transcription factors
Binding sites
Gene
Transcription Factors
Score Function
Strings
Binding Sites
Boidae
Transcription Factor
Nucleic acids
Heuristic algorithms
Computational Biology
Yeast
Nucleic Acids
Python
Software
Yeasts

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Finding optimal degenerate patterns in DNA sequences. / Shinozaki, Daisuke; Akutsu, Tatsuya; Maruyama, Osamu.

In: Bioinformatics, Vol. 19, No. SUPPL. 2, 01.12.2003, p. ii206-ii214.

Research output: Contribution to journalArticle

Shinozaki, Daisuke ; Akutsu, Tatsuya ; Maruyama, Osamu. / Finding optimal degenerate patterns in DNA sequences. In: Bioinformatics. 2003 ; Vol. 19, No. SUPPL. 2. pp. ii206-ii214.
@article{dee4bce0f80f42f084cd8b0b2ac43f30,
title = "Finding optimal degenerate patterns in DNA sequences",
abstract = "Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.",
author = "Daisuke Shinozaki and Tatsuya Akutsu and Osamu Maruyama",
year = "2003",
month = "12",
day = "1",
doi = "10.1093/bioinformatics/btg1079",
language = "English",
volume = "19",
pages = "ii206--ii214",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "SUPPL. 2",

}

TY - JOUR

T1 - Finding optimal degenerate patterns in DNA sequences

AU - Shinozaki, Daisuke

AU - Akutsu, Tatsuya

AU - Maruyama, Osamu

PY - 2003/12/1

Y1 - 2003/12/1

N2 - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

AB - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

UR - http://www.scopus.com/inward/record.url?scp=14744270169&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=14744270169&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg1079

DO - 10.1093/bioinformatics/btg1079

M3 - Article

C2 - 14534191

AN - SCOPUS:14744270169

VL - 19

SP - ii206-ii214

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - SUPPL. 2

ER -