CS 383, Algorithms
String matching: finite automata


String matching

Recall that string matching is the task of finding occurrences of a given "pattern" string within a target "text" string Inputs: a text string (length n) and a pattern string (length m) Outputs: all indices ("shifts") s such that the substring text[s+1...s+m] matches the pattern string exactly The Rabin-Karp algorithm for string matching, based on encoding strings as numbers, was discussed separately

String matching using finite automata

There is another approach to string matching Like Rabin-Karp, this approach attempts to split up the time as follows: time Theta(m^p) for preprocessing time Theta(n) for matching for a total time of Theta(m^p + n) We'll first need to study a new concept

Finite automata

A finite automaton (FA) is an extremely simple abstract computing model A FA is in one of a finite set of states at each moment in time It processes input strings one character at a time, jumping from one state to another based on what character was read A FA consists formally of several ingredients: a set of states (nodes of a graph) a special "start" state a special set of "accepting" states an alphabet of input symbols a "transition function" that determines the future state based on the current state and the current input symbol

Example of a finite automaton

A FA that accepts binary (0/1) strings that represent numbers that contain an odd number of 1's states are Even and Odd start state is Even only accept state is Odd input alphabet is {0, 1} transition function: delta(Even, 0) = Even delta(Even, 1) = Odd delta(Odd, 0) = Odd delta(Odd, 1) = Even

Basic idea of string matching using finite automata

Preprocessing: build a FA that accepts the pattern string Matching: feed the text string to the FA and start processing it; whenever the computation reaches an accepting state of the FA, print the shift, reset FA to the start state, and continue q = start state for (i=1; i<=n; i++) { q = delta(q, text[i]) if q is an accepting state print i - m } The matching time will clearly be Theta(n) for a text string of length n

Building the FA for a given pattern string

We'll need to formalize the FA construction procedure and to analyze its running time The formalization is based on a function called the suffix function, which is generated from the target pattern string

The suffix function for a given pattern string

Given a string pattern[1...m], define its suffix function sigma: sigma(any string s) = the length of the longest prefix of pattern that is a suffix of s For example, if the pattern is TATAC, then sigma(GATAATA) = 2, since the longest prefix of TATAC that is a suffix of GATAATA is TA, which has length 2

Construction of the FA in terms of the suffix function

We will specify the five ingredients of a FA that detects pattern[1..m] alphabet of input symbols is same as base string alphabet set of states is {0, 1, 2, ... m} each state q will be associated with the length q prefix of the pattern string start state is 0 only the length 0 prefix of the pattern has been read initially single accepting state is m accept only if the length m prefix of the pattern has been read the key ingredient is the transition function; here it is: delta(q, a) = sigma(P[1...q]a) this says that if the FA is in state q and reads the symbol a at the input, then the next state will be the length of the longest prefix of P that is a suffix of P[1...q]a translated: if you've previously detected exactly q symbols at the end of the input that match a prefix of the pattern string, and if you read the symbol a, then go to the state associated with the number of symbols that you've matched now (this will be q+1 if a is a match but less than that otherwise)

Example

Consider constructing a FA for the pattern string TAC states are 0, 1, 2, 3 (0 is start, 3 is accept) alphabet is {A, C, G, T} transition function: delta(0, A) = sigma(A) = 0 = delta(0, C) = delta(0, G) because A doesn't match the first symbol of TAC delta(0, T) = 1 since T matches (1 symbol matched so far) delta(1, A) = 2 since being in state 1 means that you've matched the length 1 prefix T already, and A is the correct second symbol of the pattern string delta(1, C) = delta(1, G) = 0 since neither C nor G are correct, and neither is the first symbol of the pattern string either delta(1, T) = 1 since T is not the correct second symbol, but it is the correct first symbol try working out the rest of this example...

Pseudocode for the FA construction procedure

for q=0 to m for each symbol a in the input alphabet { k = min(m, q+1) while (pattern[1...k] not a suffix of pattern[1...q]a) k = k-1 delta(q, a) = k } return delta

Running time of FA-based string matching

time = FA building time + matching time FA building time is O(m^3 * size of alphabet), for a pattern of length m (why? exercise) matching time is Theta(n) for text[1...n] total time is O(m^3 * size of alphabet) + Theta(n)

Other string matching algorithms

The Knuth-Morris-Pratt algorithm improves upon FA string matching by cutting down the preprocessing time to Theta(m) total time is Theta(m) + Theta(n) See the book by Cormen et al