### Abstract

Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

Original language | English |
---|---|

Pages (from-to) | ii206-ii214 |

Journal | Bioinformatics |

Volume | 19 |

Issue number | SUPPL. 2 |

DOIs | |

Publication status | Published - Dec 1 2003 |

### Fingerprint

### All Science Journal Classification (ASJC) codes

- Statistics and Probability
- Biochemistry
- Molecular Biology
- Computer Science Applications
- Computational Theory and Mathematics
- Computational Mathematics

### Cite this

*Bioinformatics*,

*19*(SUPPL. 2), ii206-ii214. https://doi.org/10.1093/bioinformatics/btg1079

**Finding optimal degenerate patterns in DNA sequences.** / Shinozaki, Daisuke; Akutsu, Tatsuya; Maruyama, Osamu.

Research output: Contribution to journal › Article

*Bioinformatics*, vol. 19, no. SUPPL. 2, pp. ii206-ii214. https://doi.org/10.1093/bioinformatics/btg1079

}

TY - JOUR

T1 - Finding optimal degenerate patterns in DNA sequences

AU - Shinozaki, Daisuke

AU - Akutsu, Tatsuya

AU - Maruyama, Osamu

PY - 2003/12/1

Y1 - 2003/12/1

N2 - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

AB - Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet ∑ is a sequence of subsets of ∑. A string over IUPAC nucleic acid codes is also a degenerate pattern over ∑ = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function. Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets. Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html.

UR - http://www.scopus.com/inward/record.url?scp=14744270169&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=14744270169&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btg1079

DO - 10.1093/bioinformatics/btg1079

M3 - Article

C2 - 14534191

AN - SCOPUS:14744270169

VL - 19

SP - ii206-ii214

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - SUPPL. 2

ER -