Extracting Best Consensus Motifs from Positive and Negative Examples

Erika Tateishi, Osamu Maruyama, Satoru Miyano

Research output: Contribution to journalArticle

Abstract

We define the best consensus motif (BCM) problem motivated by the problem of extracting motifs from nucleic acid and amino acid sequences. A type over an alphabetΣ is a familyΩ of subsets of Σ. A motif π of type Ω is a stringπ=π_1…π_n of motif components, each of which stands for an element in Ω. The BCM problem for Ω is, given a yes-no sample S={(α^<(1)>, β^<(1)>),...,(α^<(m)>, β^<(m)>)} of pairs of strings inΣ with α^<(i)>≠β^<(i)> for 1≤i≤m, to find a motif π of type Ω that maximizes the number of good pairs in S, where (α^<(i)>,β^<(i)>) is good forπ if π accepts α^<(i)> and rejects β^<(i)>. We prove that the BCM problem is NP-complete even for a very simple type Ω_1={z|φ≠z⊆Σ}, which is used, in practice, for describing protein motifs in the PROSITE database. We also show that the NP-completeness of the problem does not change for the type Ω_∞=Ω_1∪{Σ+}∪{Σ^<(i, j)>|1≤i≤j}, whereΣ^<(i, j)> is the set of strings over Σ of length between i and j. Furthermore, for the BCM problem forΩ_1, we provide a polynomial-time greedy algorithm based on the probabilistic method. Its performance analysis shows an explicit approximation ratio of the algorithm.
Original languageEnglish
Pages (from-to)55-64
Number of pages10
JournalIEICE technical report. Theoretical foundations of Computing
Volume95
Issue number344
Publication statusPublished - Oct 27 1995

Fingerprint

Strings
NP-completeness
Probabilistic Methods
Amino Acid Sequence
Greedy Algorithm
Polynomial-time Algorithm
Performance Analysis
NP-complete problem
Maximise
Protein
Subset
Approximation
Family

Cite this

Extracting Best Consensus Motifs from Positive and Negative Examples. / Tateishi, Erika; Maruyama, Osamu; Miyano, Satoru.

In: IEICE technical report. Theoretical foundations of Computing, Vol. 95, No. 344, 27.10.1995, p. 55-64.

Research output: Contribution to journalArticle

@article{8b45be7ca709419ea5ac58b26177a805,
title = "Extracting Best Consensus Motifs from Positive and Negative Examples",
abstract = "We define the best consensus motif (BCM) problem motivated by the problem of extracting motifs from nucleic acid and amino acid sequences. A type over an alphabetΣ is a familyΩ of subsets of Σ. A motif π of type Ω is a stringπ=π_1…π_n of motif components, each of which stands for an element in Ω. The BCM problem for Ω is, given a yes-no sample S={(α^<(1)>, β^<(1)>),...,(α^<(m)>, β^<(m)>)} of pairs of strings inΣ with α^<(i)>≠β^<(i)> for 1≤i≤m, to find a motif π of type Ω that maximizes the number of good pairs in S, where (α^<(i)>,β^<(i)>) is good forπ if π accepts α^<(i)> and rejects β^<(i)>. We prove that the BCM problem is NP-complete even for a very simple type Ω_1={z|φ≠z⊆Σ}, which is used, in practice, for describing protein motifs in the PROSITE database. We also show that the NP-completeness of the problem does not change for the type Ω_∞=Ω_1∪{Σ+}∪{Σ^<(i, j)>|1≤i≤j}, whereΣ^<(i, j)> is the set of strings over Σ of length between i and j. Furthermore, for the BCM problem forΩ_1, we provide a polynomial-time greedy algorithm based on the probabilistic method. Its performance analysis shows an explicit approximation ratio of the algorithm.",
author = "Erika Tateishi and Osamu Maruyama and Satoru Miyano",
year = "1995",
month = "10",
day = "27",
language = "English",
volume = "95",
pages = "55--64",
journal = "IEICE technical report. Theoretical foundations of Computing",
publisher = "The Institute of Electronics, Information and Communication Engineers (IEICE)",
number = "344",

}

TY - JOUR

T1 - Extracting Best Consensus Motifs from Positive and Negative Examples

AU - Tateishi, Erika

AU - Maruyama, Osamu

AU - Miyano, Satoru

PY - 1995/10/27

Y1 - 1995/10/27

N2 - We define the best consensus motif (BCM) problem motivated by the problem of extracting motifs from nucleic acid and amino acid sequences. A type over an alphabetΣ is a familyΩ of subsets of Σ. A motif π of type Ω is a stringπ=π_1…π_n of motif components, each of which stands for an element in Ω. The BCM problem for Ω is, given a yes-no sample S={(α^<(1)>, β^<(1)>),...,(α^<(m)>, β^<(m)>)} of pairs of strings inΣ with α^<(i)>≠β^<(i)> for 1≤i≤m, to find a motif π of type Ω that maximizes the number of good pairs in S, where (α^<(i)>,β^<(i)>) is good forπ if π accepts α^<(i)> and rejects β^<(i)>. We prove that the BCM problem is NP-complete even for a very simple type Ω_1={z|φ≠z⊆Σ}, which is used, in practice, for describing protein motifs in the PROSITE database. We also show that the NP-completeness of the problem does not change for the type Ω_∞=Ω_1∪{Σ+}∪{Σ^<(i, j)>|1≤i≤j}, whereΣ^<(i, j)> is the set of strings over Σ of length between i and j. Furthermore, for the BCM problem forΩ_1, we provide a polynomial-time greedy algorithm based on the probabilistic method. Its performance analysis shows an explicit approximation ratio of the algorithm.

AB - We define the best consensus motif (BCM) problem motivated by the problem of extracting motifs from nucleic acid and amino acid sequences. A type over an alphabetΣ is a familyΩ of subsets of Σ. A motif π of type Ω is a stringπ=π_1…π_n of motif components, each of which stands for an element in Ω. The BCM problem for Ω is, given a yes-no sample S={(α^<(1)>, β^<(1)>),...,(α^<(m)>, β^<(m)>)} of pairs of strings inΣ with α^<(i)>≠β^<(i)> for 1≤i≤m, to find a motif π of type Ω that maximizes the number of good pairs in S, where (α^<(i)>,β^<(i)>) is good forπ if π accepts α^<(i)> and rejects β^<(i)>. We prove that the BCM problem is NP-complete even for a very simple type Ω_1={z|φ≠z⊆Σ}, which is used, in practice, for describing protein motifs in the PROSITE database. We also show that the NP-completeness of the problem does not change for the type Ω_∞=Ω_1∪{Σ+}∪{Σ^<(i, j)>|1≤i≤j}, whereΣ^<(i, j)> is the set of strings over Σ of length between i and j. Furthermore, for the BCM problem forΩ_1, we provide a polynomial-time greedy algorithm based on the probabilistic method. Its performance analysis shows an explicit approximation ratio of the algorithm.

UR - http://ci.nii.ac.jp/naid/110003191466

M3 - Article

VL - 95

SP - 55

EP - 64

JO - IEICE technical report. Theoretical foundations of Computing

JF - IEICE technical report. Theoretical foundations of Computing

IS - 344

ER -