CS 383, Algorithms
String matching: the Rabin-Karp algorithm


A new algorithmic challenge: evaluating a polynomial

A polynomial is a linear combination of non-negative powers of x p(x) = c0 + c1*x + c2*x^2 + ... + cn*x^n The highest power n is called the degree of the polynomial Polynomials are useful in many places, including: string matching signal processing encryption Consider the task of evaluating a given polynomial at a given point x Inputs: a polynomial p(x) = c0 + c1*x + c2*x^2 + ... + cn*x^n (given as an array of coefficients, for example) a specific numerical value x=x* Outputs: the value p(x*) of the polynomial p(x) at the value x=x*

Efficiency of algorithms for polynomial evaluation

The polynomial evaluation task is not at all difficult to address Here's an obvious way: total = 0 for (i=0; i<=n; i++) total += ci*(x to the power i) return total What's the running time of this algorithm? Depends on what the basic steps are Let's assume that addition and multiplication are basic steps that can be carried out in time O(1) The big-Theta running time is dominated by the time for the for loop time = sum from i=0 to i=n of time for i-th pass for i-th pass: one assignment one addition one multiplication one "x to the power i"

Evaluating the i-th power of x

Powers must be calculated using multiplication The obvious technique requires i-1 multiplications x^i = x*x*x*...*x (i copies of x, i-1 multiplications) If we use it, the i-th pass through the polynomial evaluation loop requires time Theta(i) Grand total for entire polynomial evaluation algorithm is Theta(n^2)

A better way

You can reuse x^i when computing x^(i+1) Like this: total = 0 power = 1 for (i=0; i<=n; i++) { total += ci*power power *= x } return total i-th pass requires one assignment one addition two multiplications Total time for this algorithm is Theta(n) Simple idea, nice efficiency improvement! We'll have occasion to use similar ideas again later

An even better way

We're not going to do better than Theta(n), but it's actually possible to reduce the constant The idea is known as Horner's rule It evaluates starting with the highest power total = cn for (i=n-1; i>=0; i--) total = ci + x*total return total i-th pass requires one assignment one addition one multiplication (down from two!) Total time is still Theta(n), but we've improved the constant

String matching

String matching is the task of finding occurrences of a given "pattern" string within a target "text" string Inputs: a text string (length n) and a pattern string (length m) Outputs: all indices ("shifts") s such that the substring text[s+1...s+m] matches the pattern string exactly String matching is uesful in many contexts, including: web search bioinformatics

The obvious ("naive") string matching algorithm

Just compare each possible length m substring of the text string with the target pattern string: for (s=0; s<=n-m; s++) if (text[s+1...s+m] == pattern[1...m]) print s Simple enough, and works correctly How efficient is it? We'll seek to express the running time in terms of n and m n-m+1 passes are made through the loop s-th pass does m comparisons and possibly one print statement Total time is Theta((n-m+1)*m)

Can we do better?

Shifting the text string over by 1 every time can be wasteful E.g., suppose that the target pattern is TAAATA If there is a match starting at position 10 of the text string, then position 11 can't possibly work; in fact, the next feasible starting position would be 14 in this case This suggests that there may be ways to speed up string matching We'll start with an alternate technique that doesn't better the running time of the naive algorithm in the worst case, but that does significantly better on average

Preamble: coding strings as numbers

Decimal notation uses strings to represent numbers The string dn ... d0 represents the number dn*10^n + ... + d1*10 + d0 Given any "alphabet" of possible "digits" (like 0...9 in the case of decimal notation), we can associate a number with a string of symbols from that alphabet Let d be the total number of symbols in the alphabet Order the symbols in some way, as a0...a(d-1) Associate to each symbol a(k) the value k Then view each string w[1...n] over this alphabet as corresponding to a number in "base d" For example, if d=10 (10 symbols in the alphabet), the value is: value(w[m])*10^m + ... + value(w[2])*10 + value(w[1]) As another example, for the alphabet {A, T, C, G}, with digit values value(A) = 0 value(T) = 1 value(C) = 2 value(G) = 3 the value of the string TAC would be 102

The Rabin-Karp algorithm

What is to be gained by the above complicated numerical encoding? The answer is that it may allow faster string matching Instead of matching the pattern with shifted text substrings symbol by symbol, we match their numerical values The numerical value of the target pattern can be precomputed in time Theta(m), where m = length of pattern string In the loop, the numerical value of each text shift is computed and compared with the numerical value of the target pattern If the values of all the shifts could be computed in time Theta(n-m+1), then the total running time would be Theta(m) + Theta(n-m+1) = Theta(n) which would be great! (much better than the "naive" algorithm)

The trick for computing all the shifted values

Our hopeful reasoning above hinges on being able to compute the numerical values of all length m shifts of text in time Theta(n-m+1) How can we do this? Think of decimal strings, e.g.: 783452936 Suppose we're interested in substrinsg of length m=5 The first shift has value 78345 The second shift has value 83452 If we have to compute the second shift from scratch, it will take us a long time (how long?) Fortunately, we can use the value of the first shift to get started: 83452 = 10*(78345 - 70000) + 2 In general, we can compute value(text[s+2...s+m+1]) = 10*(value(text[s+1...s+m]) - 10^(m-1)*value(text[s+1])) + value(text[s+m+2]) This can be done in constant time O(1)!

The pre-Rabin-Karp algorithm

compute p = value(pattern) in time Theta(m) compute t = value(text[1...m]) in time Theta(m) for (s=0; s<=n-m; s++) if (t == p) print s if (s < n-m) t = 10*(t - value(text[s+1])*10^(m-1)) + value(text[s+m+1]) According to the above analysis, this algorithm runs in time Theta(n) However, we'll need to address a difficulty

Running time of the pre-Rabin-Karp algorithm

We argued that the above algorithm should run in time Theta(n): time Theta(m) for preprocessing (polynomial evaluation) n-m+1 passes through the for loop time Theta(1) on each pass total time = Theta(m) + Theta(n - m + 1) = Theta(n) However, there is a difficulty The numerical values can be outside the standard integer or long integer ranges unless the alphabet is very small and the pattern string is short

The modular remainder trick

The actual Rabin-Karp algorithm deals with the range problem by working with remainders modulo a preselected prime number q p is the value of the pattern mod q t is the value of the shifted text mod q For example, if q=11 and if value(pattern) = 17 value(text[s+1...s+m]) = 25 then the algorithm actually uses the remainders 6 and 3 instead Notice that use of the remainders means that equality of p and t no longer implies equality of the underlying strings Hence, p==t is only a "tentative match" that must be verified by comparing the actual strings

Rabin-Karp pseudocode

Assumes an alphabet with d symbols instead of only 10 h = power(d, m-1) % q p = value(pattern) % q t = value(text[1...m]) % q for (s=0; s <= n-m; s++) { if (p == t) if (pattern == text.substr(s,m)) print s if (s < n-m) t = (d*(t - value(text[s])*h) + value(text[s+m])) % q }

Running time of the Rabin-Karp algorithm

Theta(m) for preprocesing (computation of powers mod q) n - m + 1 passes through the for loop Theta(1) time for initial mod q comparison, but Theta(m) time for full verification of each tentative match Worst-case time: Theta(m) + Theta((n-m+1)*m) = Theta((n-m+1)*m) (same as the obvious algorithm that checks all text shifts) Best-case time: Theta(m) + Theta(n-m+1) = Theta(n-m+1) if there are no tentative matches

Average case running time of Rabin-Karp

What happens "on average"? "True alarms" (t==p and equal strings) occur a certain number of times call the total number of true alarms "tac" (true alarm count) total time spent on true alarms is tac*Theta(m) = Theta(m) "False alarms" (t==p but unequal strings) occur a fraction of the time educated guess is based on assuming that the mod q remainders are equally likely to be any number between 0 and q-1, so that a false alarm occurs about 1/q of the time total time spent on false alarms is Theta(m*(n-m+1)/q) Total running time is Theta(m) + Theta(n-m+1) + Theta(m*(n-m+1)/q) If q is chosen larger than m, then the time is Theta(m) + Theta(n-m+1) = Theta(n + m)