Collage system: A unifying framework for compressed pattern matching

Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, Setsuo Arikawa

Research output: Contribution to journalArticle

42 Citations (Scopus)

Abstract

We introduce a general framework which is suitable to capture the essence of compressed pattern matching according to various dictionary-based compressions. It is a formal system to represent a string by a pair of dictionary D and sequence S of phrases in D. The basic operations are concatenation, truncation, and repetition. We also propose a compressed pattern matching algorithm for the framework. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family (LZ77, LZSS, LZ78, LZW), RE-PAIR, SEQUITUR, and the static dictionary-based method. The proposed algorithm runs in O((||D||+|S|)·height(D)+m2+r) time with O(||D||+m2) space, where ||D|| is the size of D, |S| is the number of tokens in S, height(D) is the maximum dependency of tokens in D, m is the pattern length, and r is the number of pattern occurrences. For a subclass of the framework that contains no truncation, the time complexity is O(||D||+|S|+m2+r).

Original languageEnglish
Pages (from-to)253-272
Number of pages20
JournalTheoretical Computer Science
Volume298
Issue number1
DOIs
Publication statusPublished - Apr 4 2003

Fingerprint

Pattern matching
Pattern Matching
Glossaries
Truncation
Compression
String Matching
D-space
Concatenation
Matching Algorithm
Time Complexity
Strings
Framework
Dictionary

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Collage system : A unifying framework for compressed pattern matching. / Kida, Takuya; Matsumoto, Tetsuya; Shibata, Yusuke; Takeda, Masayuki; Shinohara, Ayumi; Arikawa, Setsuo.

In: Theoretical Computer Science, Vol. 298, No. 1, 04.04.2003, p. 253-272.

Research output: Contribution to journalArticle

Kida, T, Matsumoto, T, Shibata, Y, Takeda, M, Shinohara, A & Arikawa, S 2003, 'Collage system: A unifying framework for compressed pattern matching', Theoretical Computer Science, vol. 298, no. 1, pp. 253-272. https://doi.org/10.1016/S0304-3975(02)00426-7
Kida, Takuya ; Matsumoto, Tetsuya ; Shibata, Yusuke ; Takeda, Masayuki ; Shinohara, Ayumi ; Arikawa, Setsuo. / Collage system : A unifying framework for compressed pattern matching. In: Theoretical Computer Science. 2003 ; Vol. 298, No. 1. pp. 253-272.
@article{f8f71a9e11ad4df1a3519d0542c6419c,
title = "Collage system: A unifying framework for compressed pattern matching",
abstract = "We introduce a general framework which is suitable to capture the essence of compressed pattern matching according to various dictionary-based compressions. It is a formal system to represent a string by a pair of dictionary D and sequence S of phrases in D. The basic operations are concatenation, truncation, and repetition. We also propose a compressed pattern matching algorithm for the framework. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family (LZ77, LZSS, LZ78, LZW), RE-PAIR, SEQUITUR, and the static dictionary-based method. The proposed algorithm runs in O((||D||+|S|)·height(D)+m2+r) time with O(||D||+m2) space, where ||D|| is the size of D, |S| is the number of tokens in S, height(D) is the maximum dependency of tokens in D, m is the pattern length, and r is the number of pattern occurrences. For a subclass of the framework that contains no truncation, the time complexity is O(||D||+|S|+m2+r).",
author = "Takuya Kida and Tetsuya Matsumoto and Yusuke Shibata and Masayuki Takeda and Ayumi Shinohara and Setsuo Arikawa",
year = "2003",
month = "4",
day = "4",
doi = "10.1016/S0304-3975(02)00426-7",
language = "English",
volume = "298",
pages = "253--272",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",
number = "1",

}

TY - JOUR

T1 - Collage system

T2 - A unifying framework for compressed pattern matching

AU - Kida, Takuya

AU - Matsumoto, Tetsuya

AU - Shibata, Yusuke

AU - Takeda, Masayuki

AU - Shinohara, Ayumi

AU - Arikawa, Setsuo

PY - 2003/4/4

Y1 - 2003/4/4

N2 - We introduce a general framework which is suitable to capture the essence of compressed pattern matching according to various dictionary-based compressions. It is a formal system to represent a string by a pair of dictionary D and sequence S of phrases in D. The basic operations are concatenation, truncation, and repetition. We also propose a compressed pattern matching algorithm for the framework. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family (LZ77, LZSS, LZ78, LZW), RE-PAIR, SEQUITUR, and the static dictionary-based method. The proposed algorithm runs in O((||D||+|S|)·height(D)+m2+r) time with O(||D||+m2) space, where ||D|| is the size of D, |S| is the number of tokens in S, height(D) is the maximum dependency of tokens in D, m is the pattern length, and r is the number of pattern occurrences. For a subclass of the framework that contains no truncation, the time complexity is O(||D||+|S|+m2+r).

AB - We introduce a general framework which is suitable to capture the essence of compressed pattern matching according to various dictionary-based compressions. It is a formal system to represent a string by a pair of dictionary D and sequence S of phrases in D. The basic operations are concatenation, truncation, and repetition. We also propose a compressed pattern matching algorithm for the framework. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family (LZ77, LZSS, LZ78, LZW), RE-PAIR, SEQUITUR, and the static dictionary-based method. The proposed algorithm runs in O((||D||+|S|)·height(D)+m2+r) time with O(||D||+m2) space, where ||D|| is the size of D, |S| is the number of tokens in S, height(D) is the maximum dependency of tokens in D, m is the pattern length, and r is the number of pattern occurrences. For a subclass of the framework that contains no truncation, the time complexity is O(||D||+|S|+m2+r).

UR - http://www.scopus.com/inward/record.url?scp=0037418753&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037418753&partnerID=8YFLogxK

U2 - 10.1016/S0304-3975(02)00426-7

DO - 10.1016/S0304-3975(02)00426-7

M3 - Article

AN - SCOPUS:0037418753

VL - 298

SP - 253

EP - 272

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

IS - 1

ER -