A string pattern regression algorithm and its application to pattern discovery in long introns.

Hideo Bannai, Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Satoru Miyano

研究成果: ジャーナルへの寄稿記事

17 引用 (Scopus)

抄録

We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

元の言語英語
ページ(範囲)3-11
ページ数9
ジャーナルGenome informatics. International Conference on Genome Informatics
13
出版物ステータス出版済み - 1 1 2002
外部発表Yes

Fingerprint

Introns
Zebrafish
Diptera
Gene Expression
Datasets

All Science Journal Classification (ASJC) codes

  • Medicine(all)

これを引用

@article{e604121fc4ee4666a125ce204cdc53bf,
title = "A string pattern regression algorithm and its application to pattern discovery in long introns.",
abstract = "We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.",
author = "Hideo Bannai and Shunsuke Inenaga and Ayumi Shinohara and Masayuki Takeda and Satoru Miyano",
year = "2002",
month = "1",
day = "1",
language = "English",
volume = "13",
pages = "3--11",
journal = "Genome informatics. International Conference on Genome Informatics",
issn = "0919-9454",
publisher = "Universal Academy Press",

}

TY - JOUR

T1 - A string pattern regression algorithm and its application to pattern discovery in long introns.

AU - Bannai, Hideo

AU - Inenaga, Shunsuke

AU - Shinohara, Ayumi

AU - Takeda, Masayuki

AU - Miyano, Satoru

PY - 2002/1/1

Y1 - 2002/1/1

N2 - We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

AB - We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

UR - http://www.scopus.com/inward/record.url?scp=0642313001&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0642313001&partnerID=8YFLogxK

M3 - Article

VL - 13

SP - 3

EP - 11

JO - Genome informatics. International Conference on Genome Informatics

JF - Genome informatics. International Conference on Genome Informatics

SN - 0919-9454

ER -