CS 383, Algorithms
String matching: the Rabin-Karp algorithm

A new algorithmic challenge: evaluating a polynomial

	A polynomial is a linear combination of non-negative powers of x

		p(x) = c0 + c1*x + c2*x^2 + ... + cn*x^n

	The highest power n is called the degree of the polynomial

	Polynomials are useful in many places, including:

		string matching
		signal processing
		encryption

	Consider the task of evaluating a given polynomial at a given point x

	Inputs: 
		a polynomial p(x) = c0 + c1*x + c2*x^2 + ... + cn*x^n
		(given as an array of coefficients, for example)

		a specific numerical value x=x*

	Outputs:
		the value p(x*) of the polynomial p(x) at the value x=x*


Efficiency of algorithms for polynomial evaluation

	The polynomial evaluation task is not at all difficult to address

	Here's an obvious way:

		total = 0
		for (i=0; i<=n; i++)
			total += ci*(x to the power i)
		return total

	What's the running time of this algorithm?

	Depends on what the basic steps are

	Let's assume that addition and multiplication are basic steps
	that can be carried out in time O(1)

	The big-Theta running time is dominated by the time for the for loop

		time = sum from i=0 to i=n of time for i-th pass

		for i-th pass:

			one assignment
			one addition
			one multiplication
			one "x to the power i"


Evaluating the i-th power of x

	Powers must be calculated using multiplication

	The obvious technique requires i-1 multiplications

		x^i = x*x*x*...*x (i copies of x, i-1 multiplications)

	If we use it, the i-th pass through the polynomial evaluation loop
	requires time Theta(i)

	Grand total for entire polynomial evaluation algorithm is Theta(n^2)


A better way

	You can reuse x^i when computing x^(i+1)

	Like this:

		total = 0
		power = 1
		for (i=0; i<=n; i++) {
			total += ci*power
			power *= x
		}
		return total

	i-th pass requires

			one assignment
			one addition
			two multiplications
		
	Total time for this algorithm is Theta(n)

	Simple idea, nice efficiency improvement!

	We'll have occasion to use similar ideas again later


An even better way

	We're not going to do better than Theta(n),
	but it's actually possible to reduce the constant

	The idea is known as Horner's rule

	It evaluates starting with the highest power

                total = cn
                for (i=n-1; i>=0; i--) 
                        total = ci + x*total
                return total

        i-th pass requires

                        one assignment
                        one addition
                        one multiplication (down from two!)

        Total time is still Theta(n), but we've improved the constant


String matching

	String matching is the task of finding occurrences of a given
	"pattern" string within a target "text" string

	Inputs: a text string (length n) and a pattern string (length m)

	Outputs: all indices ("shifts") s such that the substring 
	text[s+1...s+m] matches the pattern string exactly

	String matching is uesful in many contexts, including:

		web search
		bioinformatics


The obvious ("naive") string matching algorithm

	Just compare each possible length m substring of the text string
	with the target pattern string:

		for (s=0; s<=n-m; s++)
			if (text[s+1...s+m] == pattern[1...m])
				print s

	Simple enough, and works correctly

	How efficient is it? 

	We'll seek to express the running time in terms of n and m

	n-m+1 passes are made through the loop

	s-th pass does m comparisons and possibly one print statement

	Total time is Theta((n-m+1)*m)


Can we do better?

	Shifting the text string over by 1 every time can be wasteful

	E.g., suppose that the target pattern is TAAATA

	If there is a match starting at position 10 of the text string,
	then position 11 can't possibly work; in fact, the next feasible 
	starting position would be 14 in this case

	This suggests that there may be ways to speed up string matching

	We'll start with an alternate technique that doesn't better the
	running time of the naive algorithm in the worst case, but that
	does significantly better on average


Preamble: coding strings as numbers

	Decimal notation uses strings to represent numbers

	The string dn ... d0 represents the number 

		dn*10^n + ... + d1*10 + d0

	Given any "alphabet" of possible "digits" (like 0...9 in
	the case of decimal notation), we can associate a number
	with a string of symbols from that alphabet

	Let d be the total number of symbols in the alphabet

	Order the symbols in some way, as a0...a(d-1)

	Associate to each symbol a(k) the value k

	Then view each string w[1...n] over this alphabet 
	as corresponding to a number in "base d"

	For example, if d=10 (10 symbols in the alphabet), the value is:

		value(w[m])*10^m + ... + value(w[2])*10 + value(w[1])

	As another example, for the alphabet {A, T, C, G}, with digit values

		value(A) = 0
		value(T) = 1
		value(C) = 2
		value(G) = 3

	the value of the string TAC would be 102


The Rabin-Karp algorithm

	What is to be gained by the above complicated numerical encoding?

	The answer is that it may allow faster string matching

	Instead of matching the pattern with shifted text substrings
	symbol by symbol, we match their numerical values

	The numerical value of the target pattern can be precomputed

		in time Theta(m), where m = length of pattern string

	In the loop, the numerical value of each text shift is computed
	and compared with the numerical value of the target pattern

	If the values of all the shifts could be computed in time Theta(n-m+1),
	then the total running time would be 

		Theta(m) + Theta(n-m+1) = Theta(n)

	which would be great! (much better than the "naive" algorithm)


The trick for computing all the shifted values

	Our hopeful reasoning above hinges on being able to compute the
	numerical values of all length m shifts of text in time Theta(n-m+1)

	How can we do this?

	Think of decimal strings, e.g.:

		783452936

	Suppose we're interested in substrinsg of length m=5

	The first shift has value 78345

	The second shift has value 83452

	If we have to compute the second shift from scratch,
	it will take us a long time (how long?)

	Fortunately, we can use the value of the first shift to get started:

		83452 = 10*(78345 - 70000) + 2

	In general, we can compute

		value(text[s+2...s+m+1]) = 
			10*(value(text[s+1...s+m]) - 10^(m-1)*value(text[s+1]))
			+ value(text[s+m+2])

	This can be done in constant time O(1)! 


The pre-Rabin-Karp algorithm

	compute p = value(pattern) in time Theta(m)

	compute t = value(text[1...m]) in time Theta(m)

	for (s=0; s<=n-m; s++)
		if (t == p)
			print s
		if (s < n-m)
			t = 10*(t - value(text[s+1])*10^(m-1)) + value(text[s+m+1])
		
	According to the above analysis, this algorithm runs in time Theta(n)

	However, we'll need to address a difficulty


Running time of the pre-Rabin-Karp algorithm

	We argued that the above algorithm should run in time Theta(n):

		time Theta(m) for preprocessing (polynomial evaluation)

		n-m+1 passes through the for loop

			time Theta(1) on each pass

		total time = Theta(m) + Theta(n - m + 1) = Theta(n)

	However, there is a difficulty

	The numerical values can be outside the standard integer 
	or long integer ranges unless the alphabet is very small
	and the pattern string is short


The modular remainder trick

	The actual Rabin-Karp algorithm deals with the range problem
	by working with remainders modulo a preselected prime number q

		p is the value of the pattern mod q

		t is the value of the shifted text mod q

	For example, if q=11 and if 

		value(pattern) = 17

		value(text[s+1...s+m]) = 25

	then the algorithm actually uses the remainders 6 and 3 instead

	Notice that use of the remainders means that equality of p and t 
	no longer implies equality of the underlying strings

	Hence, p==t is only a "tentative match" that must be verified
	by comparing the actual strings


Rabin-Karp pseudocode

	Assumes an alphabet with d symbols instead of only 10

        h = power(d, m-1) % q
	p = value(pattern) % q
	t = value(text[1...m]) % q
        for (s=0; s <= n-m; s++) {
         	if (p == t) 
			if (pattern == text.substr(s,m))
                		print s
		if (s < n-m)
                	t = (d*(t - value(text[s])*h) 
				+ value(text[s+m])) % q
        }


Running time of the Rabin-Karp algorithm

	Theta(m) for preprocesing (computation of powers mod q)

	n - m + 1 passes through the for loop

	Theta(1) time for initial mod q comparison,
	but Theta(m) time for full verification of each tentative match

	Worst-case time: Theta(m) + Theta((n-m+1)*m) = Theta((n-m+1)*m)
	(same as the obvious algorithm that checks all text shifts)

	Best-case time: Theta(m) + Theta(n-m+1) = Theta(n-m+1) 
	if there are no tentative matches


Average case running time of Rabin-Karp

	What happens "on average"?

	"True alarms" (t==p and equal strings) occur a certain number of times

		call the total number of true alarms "tac" (true alarm count)

		total time spent on true alarms is tac*Theta(m) = Theta(m)

	"False alarms" (t==p but unequal strings) occur a fraction of the time

		educated guess is based on assuming that the mod q remainders
		are equally likely to be any number between 0 and q-1, so that
		a false alarm occurs about 1/q of the time

		total time spent on false alarms is Theta(m*(n-m+1)/q)

	Total running time is

		Theta(m) + Theta(n-m+1) + Theta(m*(n-m+1)/q)

	If q is chosen larger than m, then the time is

		Theta(m) + Theta(n-m+1) = Theta(n + m)
CS 383, Algorithms String matching: the Rabin-Karp algorithm

A new algorithmic challenge: evaluating a polynomial

Efficiency of algorithms for polynomial evaluation

Evaluating the i-th power of x

A better way

An even better way

String matching

The obvious ("naive") string matching algorithm

Can we do better?

Preamble: coding strings as numbers

The Rabin-Karp algorithm

The trick for computing all the shifted values

The pre-Rabin-Karp algorithm

Running time of the pre-Rabin-Karp algorithm

The modular remainder trick

Rabin-Karp pseudocode

Running time of the Rabin-Karp algorithm

Average case running time of Rabin-Karp

CS 383, Algorithms
String matching: the Rabin-Karp algorithm