### Abstract

We define the best consensus motif (BCM) problem motivated by the problem of extracting motifs from nucleic acid and amino acid sequences. A type over an alphabet Σ is a family Ω of subsets of Σ *. A motif π of type Ω is a string π=π_{1} ... π_{n} of motif components, each of which stands for an element in Ω. The BCM problem for Ω is, given a yes-no sample S={(α ^{(1)},β^{(1)},..., (α^{(m)},β^{(m)})} of pairs of strings in Σ* with α ^{(i)} ≠β^{(i)} for 1 ≤ i ≤ m, to find a motif π of type Ω that maximizes the number of good pairs in S, where (α ^{(i)}, β ^{(i)}) is good for π if π accepts α (i) and rejects β _{(i)} We prove that the BCM problem is NP-complete even for a very simple type (Formula presented), which is used, in practice, for describing protein motifs in the PROSITE database. We also show that the NP-completeness of the problem does not change for the type Ω _{∞}=Ω_{1}∪ {Σ^{+}}∪{Σ^{[i,j]}1≤i≤ j}, where Σ ^{[i,j]} is the set of strings over Σ of length between i and j Furthermore, for the BCM problem for Ω _{1} we provide a polynomial-time greedy algorithm based on the probabilistic method. Its performance analysis shows an explicit approximation ratio of the algorithm.

