Efficient dynamic dictionary matching with DAWGs and AC-automata

Diptarama Hendrian, Shunsuke Inenaga, Ryo Yoshinaka, Ayumi Shinohara

研究成果: ジャーナルへの寄稿記事

抄録

The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T. The Aho–Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(dlog⁡σ) preprocessing time and O(nlog⁡σ+occ) matching time, where d is the total length of the patterns in the dictionary D, n is the length of the text, σ is the alphabet size, and occ is the total number of occurrences of all the patterns in the text. The dynamic dictionary matching is a variant where patterns may dynamically be inserted into and deleted from the dictionary D. This problem is called semi-dynamic dictionary matching if only insertions are allowed. In this paper, we propose two efficient algorithms that can solve both problems with some modifications. For a pattern of length m, our first algorithm supports insertions in O(mlog⁡σ+log⁡d/log⁡log⁡d) time and pattern matching in O(nlog⁡σ+occ) for the semi-dynamic setting. This algorithm also supports both insertions and deletions in O(σm+log⁡d/log⁡log⁡d) time and pattern matching in O(n(log⁡d/log⁡log⁡d+log⁡σ)+occ(log⁡d/log⁡log⁡d)) time for the dynamic dictionary matching problem by some modifications. This algorithm is based on the directed acyclic word graph (DAWG) of Blumer et al. (JACM 1987). Our second algorithm, which is based on the AC-automaton, supports insertions in O(mlog⁡σ+uf+uo) time for the semi-dynamic setting and supports both insertions and deletions in O(σm+uf+uo) time for the dynamic setting, where uf and uo respectively denote the numbers of states in which the failure function and the output function need to be updated. This algorithm performs pattern matching in O(nlog⁡σ+occ) time for both settings. Our algorithm achieves optimal update time for AC-automaton based methods over constant-size alphabets, since any algorithm which explicitly maintains the AC-automaton requires Ω(m+uf+uo) update time.

元の言語英語
ページ(範囲)161-172
ページ数12
ジャーナルTheoretical Computer Science
792
DOI
出版物ステータス出版済み - 11 5 2019

Fingerprint

Glossaries
Automata
Insertion
Pattern matching
Pattern Matching
Matching Problem
Deletion
Strings
Update
Dictionary
Data structures
Optimal Algorithm
Preprocessing
Data Structures
Efficient Algorithms
Denote
Output

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

これを引用

Efficient dynamic dictionary matching with DAWGs and AC-automata. / Hendrian, Diptarama; Inenaga, Shunsuke; Yoshinaka, Ryo; Shinohara, Ayumi.

:: Theoretical Computer Science, 巻 792, 05.11.2019, p. 161-172.

研究成果: ジャーナルへの寄稿記事

Hendrian, Diptarama ; Inenaga, Shunsuke ; Yoshinaka, Ryo ; Shinohara, Ayumi. / Efficient dynamic dictionary matching with DAWGs and AC-automata. :: Theoretical Computer Science. 2019 ; 巻 792. pp. 161-172.
@article{ca3fbdef18d342b69ed050f0aef33984,
title = "Efficient dynamic dictionary matching with DAWGs and AC-automata",
abstract = "The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T. The Aho–Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(dlog⁡σ) preprocessing time and O(nlog⁡σ+occ) matching time, where d is the total length of the patterns in the dictionary D, n is the length of the text, σ is the alphabet size, and occ is the total number of occurrences of all the patterns in the text. The dynamic dictionary matching is a variant where patterns may dynamically be inserted into and deleted from the dictionary D. This problem is called semi-dynamic dictionary matching if only insertions are allowed. In this paper, we propose two efficient algorithms that can solve both problems with some modifications. For a pattern of length m, our first algorithm supports insertions in O(mlog⁡σ+log⁡d/log⁡log⁡d) time and pattern matching in O(nlog⁡σ+occ) for the semi-dynamic setting. This algorithm also supports both insertions and deletions in O(σm+log⁡d/log⁡log⁡d) time and pattern matching in O(n(log⁡d/log⁡log⁡d+log⁡σ)+occ(log⁡d/log⁡log⁡d)) time for the dynamic dictionary matching problem by some modifications. This algorithm is based on the directed acyclic word graph (DAWG) of Blumer et al. (JACM 1987). Our second algorithm, which is based on the AC-automaton, supports insertions in O(mlog⁡σ+uf+uo) time for the semi-dynamic setting and supports both insertions and deletions in O(σm+uf+uo) time for the dynamic setting, where uf and uo respectively denote the numbers of states in which the failure function and the output function need to be updated. This algorithm performs pattern matching in O(nlog⁡σ+occ) time for both settings. Our algorithm achieves optimal update time for AC-automaton based methods over constant-size alphabets, since any algorithm which explicitly maintains the AC-automaton requires Ω(m+uf+uo) update time.",
author = "Diptarama Hendrian and Shunsuke Inenaga and Ryo Yoshinaka and Ayumi Shinohara",
year = "2019",
month = "11",
day = "5",
doi = "10.1016/j.tcs.2018.04.016",
language = "English",
volume = "792",
pages = "161--172",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",

}

TY - JOUR

T1 - Efficient dynamic dictionary matching with DAWGs and AC-automata

AU - Hendrian, Diptarama

AU - Inenaga, Shunsuke

AU - Yoshinaka, Ryo

AU - Shinohara, Ayumi

PY - 2019/11/5

Y1 - 2019/11/5

N2 - The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T. The Aho–Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(dlog⁡σ) preprocessing time and O(nlog⁡σ+occ) matching time, where d is the total length of the patterns in the dictionary D, n is the length of the text, σ is the alphabet size, and occ is the total number of occurrences of all the patterns in the text. The dynamic dictionary matching is a variant where patterns may dynamically be inserted into and deleted from the dictionary D. This problem is called semi-dynamic dictionary matching if only insertions are allowed. In this paper, we propose two efficient algorithms that can solve both problems with some modifications. For a pattern of length m, our first algorithm supports insertions in O(mlog⁡σ+log⁡d/log⁡log⁡d) time and pattern matching in O(nlog⁡σ+occ) for the semi-dynamic setting. This algorithm also supports both insertions and deletions in O(σm+log⁡d/log⁡log⁡d) time and pattern matching in O(n(log⁡d/log⁡log⁡d+log⁡σ)+occ(log⁡d/log⁡log⁡d)) time for the dynamic dictionary matching problem by some modifications. This algorithm is based on the directed acyclic word graph (DAWG) of Blumer et al. (JACM 1987). Our second algorithm, which is based on the AC-automaton, supports insertions in O(mlog⁡σ+uf+uo) time for the semi-dynamic setting and supports both insertions and deletions in O(σm+uf+uo) time for the dynamic setting, where uf and uo respectively denote the numbers of states in which the failure function and the output function need to be updated. This algorithm performs pattern matching in O(nlog⁡σ+occ) time for both settings. Our algorithm achieves optimal update time for AC-automaton based methods over constant-size alphabets, since any algorithm which explicitly maintains the AC-automaton requires Ω(m+uf+uo) update time.

AB - The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T. The Aho–Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(dlog⁡σ) preprocessing time and O(nlog⁡σ+occ) matching time, where d is the total length of the patterns in the dictionary D, n is the length of the text, σ is the alphabet size, and occ is the total number of occurrences of all the patterns in the text. The dynamic dictionary matching is a variant where patterns may dynamically be inserted into and deleted from the dictionary D. This problem is called semi-dynamic dictionary matching if only insertions are allowed. In this paper, we propose two efficient algorithms that can solve both problems with some modifications. For a pattern of length m, our first algorithm supports insertions in O(mlog⁡σ+log⁡d/log⁡log⁡d) time and pattern matching in O(nlog⁡σ+occ) for the semi-dynamic setting. This algorithm also supports both insertions and deletions in O(σm+log⁡d/log⁡log⁡d) time and pattern matching in O(n(log⁡d/log⁡log⁡d+log⁡σ)+occ(log⁡d/log⁡log⁡d)) time for the dynamic dictionary matching problem by some modifications. This algorithm is based on the directed acyclic word graph (DAWG) of Blumer et al. (JACM 1987). Our second algorithm, which is based on the AC-automaton, supports insertions in O(mlog⁡σ+uf+uo) time for the semi-dynamic setting and supports both insertions and deletions in O(σm+uf+uo) time for the dynamic setting, where uf and uo respectively denote the numbers of states in which the failure function and the output function need to be updated. This algorithm performs pattern matching in O(nlog⁡σ+occ) time for both settings. Our algorithm achieves optimal update time for AC-automaton based methods over constant-size alphabets, since any algorithm which explicitly maintains the AC-automaton requires Ω(m+uf+uo) update time.

UR - http://www.scopus.com/inward/record.url?scp=85045537552&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045537552&partnerID=8YFLogxK

U2 - 10.1016/j.tcs.2018.04.016

DO - 10.1016/j.tcs.2018.04.016

M3 - Article

AN - SCOPUS:85045537552

VL - 792

SP - 161

EP - 172

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

ER -