Best fitting fixed-length substring patterns for a set of strings

Hirotaka Ono, Yen Kaow Ng

Research output: Contribution to journalConference article

3 Citations (Scopus)

Abstract

Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of "over-generalization", which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.

Original languageEnglish
Pages (from-to)240-250
Number of pages11
JournalLecture Notes in Computer Science
Volume3595
Publication statusPublished - Oct 24 2005
Event11th Annual International Conference on Computing and Combinatorics, COCOON 2005 - Kunming, China
Duration: Aug 16 2005Aug 29 2005

Fingerprint

Molecular biology
Data mining
Strings
Polynomials
Counting Problems
Three Solutions
Molecular Biology
Knowledge Discovery
Polynomial time
Union
NP-complete problem

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Best fitting fixed-length substring patterns for a set of strings. / Ono, Hirotaka; Ng, Yen Kaow.

In: Lecture Notes in Computer Science, Vol. 3595, 24.10.2005, p. 240-250.

Research output: Contribution to journalConference article

Ono, Hirotaka ; Ng, Yen Kaow. / Best fitting fixed-length substring patterns for a set of strings. In: Lecture Notes in Computer Science. 2005 ; Vol. 3595. pp. 240-250.
@article{d1d3ee8660be4df1ace140c9d05cc103,
title = "Best fitting fixed-length substring patterns for a set of strings",
abstract = "Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of {"}over-generalization{"}, which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.",
author = "Hirotaka Ono and Ng, {Yen Kaow}",
year = "2005",
month = "10",
day = "24",
language = "English",
volume = "3595",
pages = "240--250",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Best fitting fixed-length substring patterns for a set of strings

AU - Ono, Hirotaka

AU - Ng, Yen Kaow

PY - 2005/10/24

Y1 - 2005/10/24

N2 - Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of "over-generalization", which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.

AB - Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of "over-generalization", which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.

UR - http://www.scopus.com/inward/record.url?scp=26844516081&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26844516081&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:26844516081

VL - 3595

SP - 240

EP - 250

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

ER -