CS 383, Algorithms
String matching: finite automata

String matching

	Recall that string matching is the task of finding occurrences of a 
	given "pattern" string within a target "text" string

	Inputs: a text string (length n) and a pattern string (length m)

	Outputs: all indices ("shifts") s such that the substring 
	text[s+1...s+m] matches the pattern string exactly

	The Rabin-Karp algorithm for string matching, based on encoding strings as numbers, was discussed separately


String matching using finite automata

	There is another approach to string matching

	Like Rabin-Karp, this approach attempts to split up the time as follows:

		time Theta(m^p) for preprocessing

		time Theta(n) for matching

	for a total time of Theta(m^p + n)

	We'll first need to study a new concept


Finite automata

	A finite automaton (FA) is an extremely simple abstract computing model

	A FA is in one of a finite set of states at each moment in time

	It processes input strings one character at a time,
	jumping from one state to another based on what character was read

	A FA consists formally of several ingredients:

		a set of states (nodes of a graph)

		a special "start" state

		a special set of "accepting" states

		an alphabet of input symbols

		a "transition function" that determines the future state
		based on the current state and the current input symbol

		
Example of a finite automaton

	A FA that accepts binary (0/1) strings that represent numbers
	that contain an odd number of 1's

		states are Even and Odd

		start state is Even

		only accept state is Odd

		input alphabet is {0, 1}

		transition function:

			delta(Even, 0) = Even
			delta(Even, 1) = Odd

			delta(Odd, 0) = Odd
			delta(Odd, 1) = Even

Basic idea of string matching using finite automata

	Preprocessing: build a FA that accepts the pattern string

	Matching: feed the text string to the FA and start processing it;
	whenever the computation reaches an accepting state of the FA, 
	print the shift, reset FA to the start state, and continue

	q = start state
	for (i=1; i<=n; i++) {
		q = delta(q, text[i])
		if q is an accepting state
			print i - m
	}

	The matching time will clearly be Theta(n) for a text string of length n


Building the FA for a given pattern string

	We'll need to formalize the FA construction procedure and to analyze its running time

	The formalization is based on a function called the suffix function, which is generated from the target pattern string


The suffix function for a given pattern string

	Given a string pattern[1...m], define its suffix function sigma:

	sigma(any string s) = the length of the longest prefix of pattern
				that is a suffix of s

	For example, if the pattern is TATAC, then

		sigma(GATAATA) = 2, since the longest prefix of TATAC
		that is a suffix of GATAATA is TA, which has length 2


Construction of the FA in terms of the suffix function

	We will specify the five ingredients of a FA that detects pattern[1..m]

	alphabet of input symbols is same as base string alphabet

	set of states is {0, 1, 2, ... m}

		each state q will be associated with the length q prefix of the pattern string

	start state is 0

		only the length 0 prefix of the pattern has been read initially

	single accepting state is m

		accept only if the length m prefix of the pattern has been read

	the key ingredient is the transition function; here it is:

		delta(q, a) = sigma(P[1...q]a)

	this says that if the FA is in state q and reads the symbol a at the input, 
	then the next state will be the length of the longest prefix of P that is a suffix of P[1...q]a

	translated: if you've previously detected exactly q symbols at the end of the input 
	that match a prefix of the pattern string, and if you read the symbol a, 
	then go to the state associated with the number of symbols that you've matched now 
	(this will be q+1 if a is a match but less than that otherwise)


Example

	Consider constructing a FA for the pattern string TAC

	states are 0, 1, 2, 3 (0 is start, 3 is accept)

	alphabet is {A, C, G, T}

	transition function:

		delta(0, A) = sigma(A) = 0 = delta(0, C) = delta(0, G)
		because A doesn't match the first symbol of TAC

		delta(0, T) = 1 since T matches (1 symbol matched so far)

		delta(1, A) = 2 since being in state 1 means that you've matched the length 1 prefix T already,
		and A is the correct second symbol of the pattern string

		delta(1, C) = delta(1, G) = 0 since neither C nor G are correct, and neither is the first symbol of the pattern string either

		delta(1, T) = 1 since T is not the correct second symbol, but it is the correct first symbol

	try working out the rest of this example...


Pseudocode for the FA construction procedure

	for q=0 to m
		for each symbol a in the input alphabet {
			k = min(m, q+1)	
			while (pattern[1...k] not a suffix of pattern[1...q]a)
				k = k-1
			delta(q, a) = k
		}
	return delta


Running time of FA-based string matching

	time = FA building time + matching time

	FA building time is O(m^3 * size of alphabet), 
	for a pattern of length m
	(why? exercise)

	matching time is Theta(n) for text[1...n]

	total time is O(m^3 * size of alphabet) + Theta(n)


Other string matching algorithms

	The Knuth-Morris-Pratt algorithm improves upon FA string matching by cutting down the preprocessing time to Theta(m)

	total time is Theta(m) + Theta(n)

	See the book by Cormen et al
CS 383, Algorithms String matching: finite automata

String matching

String matching using finite automata

Finite automata

Example of a finite automaton

Basic idea of string matching using finite automata

Building the FA for a given pattern string

The suffix function for a given pattern string

Construction of the FA in terms of the suffix function

Example

Pseudocode for the FA construction procedure

Running time of FA-based string matching

Other string matching algorithms

CS 383, Algorithms
String matching: finite automata