String matching
Recall that string matching is the task of finding occurrences of a given "pattern" string within a target "text" string Inputs: a text string (length n) and a pattern string (length m) Outputs: all indices ("shifts") s such that the substring text[s+1...s+m] matches the pattern string exactly The Rabin-Karp algorithm for string matching, based on encoding strings as numbers, was discussed separatelyString matching using finite automata
There is another approach to string matching Like Rabin-Karp, this approach attempts to split up the time as follows: time Theta(m^p) for preprocessing time Theta(n) for matching for a total time of Theta(m^p + n) We'll first need to study a new conceptFinite automata
A finite automaton (FA) is an extremely simple abstract computing model A FA is in one of a finite set of states at each moment in time It processes input strings one character at a time, jumping from one state to another based on what character was read A FA consists formally of several ingredients: a set of states (nodes of a graph) a special "start" state a special set of "accepting" states an alphabet of input symbols a "transition function" that determines the future state based on the current state and the current input symbolExample of a finite automaton
A FA that accepts binary (0/1) strings that represent numbers that contain an odd number of 1's states are Even and Odd start state is Even only accept state is Odd input alphabet is {0, 1} transition function: delta(Even, 0) = Even delta(Even, 1) = Odd delta(Odd, 0) = Odd delta(Odd, 1) = EvenBasic idea of string matching using finite automata
Preprocessing: build a FA that accepts the pattern string Matching: feed the text string to the FA and start processing it; whenever the computation reaches an accepting state of the FA, print the shift, reset FA to the start state, and continue q = start state for (i=1; i<=n; i++) { q = delta(q, text[i]) if q is an accepting state print i - m } The matching time will clearly be Theta(n) for a text string of length nBuilding the FA for a given pattern string
We'll need to formalize the FA construction procedure and to analyze its running time The formalization is based on a function called the suffix function, which is generated from the target pattern stringThe suffix function for a given pattern string
Given a string pattern[1...m], define its suffix function sigma: sigma(any string s) = the length of the longest prefix of pattern that is a suffix of s For example, if the pattern is TATAC, then sigma(GATAATA) = 2, since the longest prefix of TATAC that is a suffix of GATAATA is TA, which has length 2Construction of the FA in terms of the suffix function
We will specify the five ingredients of a FA that detects pattern[1..m] alphabet of input symbols is same as base string alphabet set of states is {0, 1, 2, ... m} each state q will be associated with the length q prefix of the pattern string start state is 0 only the length 0 prefix of the pattern has been read initially single accepting state is m accept only if the length m prefix of the pattern has been read the key ingredient is the transition function; here it is: delta(q, a) = sigma(P[1...q]a) this says that if the FA is in state q and reads the symbol a at the input, then the next state will be the length of the longest prefix of P that is a suffix of P[1...q]a translated: if you've previously detected exactly q symbols at the end of the input that match a prefix of the pattern string, and if you read the symbol a, then go to the state associated with the number of symbols that you've matched now (this will be q+1 if a is a match but less than that otherwise)Example
Consider constructing a FA for the pattern string TAC states are 0, 1, 2, 3 (0 is start, 3 is accept) alphabet is {A, C, G, T} transition function: delta(0, A) = sigma(A) = 0 = delta(0, C) = delta(0, G) because A doesn't match the first symbol of TAC delta(0, T) = 1 since T matches (1 symbol matched so far) delta(1, A) = 2 since being in state 1 means that you've matched the length 1 prefix T already, and A is the correct second symbol of the pattern string delta(1, C) = delta(1, G) = 0 since neither C nor G are correct, and neither is the first symbol of the pattern string either delta(1, T) = 1 since T is not the correct second symbol, but it is the correct first symbol try working out the rest of this example...Pseudocode for the FA construction procedure
for q=0 to m for each symbol a in the input alphabet { k = min(m, q+1) while (pattern[1...k] not a suffix of pattern[1...q]a) k = k-1 delta(q, a) = k } return deltaRunning time of FA-based string matching
time = FA building time + matching time FA building time is O(m^3 * size of alphabet), for a pattern of length m (why? exercise) matching time is Theta(n) for text[1...n] total time is O(m^3 * size of alphabet) + Theta(n)Other string matching algorithms
The Knuth-Morris-Pratt algorithm improves upon FA string matching by cutting down the preprocessing time to Theta(m) total time is Theta(m) + Theta(n) See the book by Cormen et al