TY - GEN

T1 - Characteristic sets of strings common to semi-structured documents

AU - Ikeda, Daisuke

PY - 1999/1/1

Y1 - 1999/1/1

N2 - We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x1., xd) of strings such that each xi is a suffix of xi+1 and all xi’s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user’s defined macros. For example, (“set”,“characteristic set”,“ characteristic set”) is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive doc- uments but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n2 hd) time, where n is the total length of documents and h is the height of the suffix tree of the documents.

AB - We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x1., xd) of strings such that each xi is a suffix of xi+1 and all xi’s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user’s defined macros. For example, (“set”,“characteristic set”,“ characteristic set”) is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive doc- uments but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n2 hd) time, where n is the total length of documents and h is the height of the suffix tree of the documents.

UR - http://www.scopus.com/inward/record.url?scp=84919495567&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84919495567&partnerID=8YFLogxK

U2 - 10.1007/3-540-46846-3_13

DO - 10.1007/3-540-46846-3_13

M3 - Conference contribution

AN - SCOPUS:84919495567

SN - 354066713X

SN - 9783540667131

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 139

EP - 147

BT - Discovery Science - 2nd International Conference, DS 1999, Proceedings

A2 - Arikawa, Setsuo

A2 - Furukawa, Koichi

PB - Springer Verlag

T2 - 2nd International Conference on Discovery Science, DS 1999

Y2 - 6 December 1999 through 8 December 1999

ER -