# String Matching

Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. Typically, the text is a document being edited, and the pattern searched for is a particular word supplied by the user. Efficient algorithms for this problem can greatly aid the responsiveness of the text-editing program. String-matching algorithms are also used, for example, to search for particular patterns in DNA sequences.

We formalize the * string-matching problem* as follows. We assume that the text is an array

*T*[1 . .

*n*] of length

*n*and that the pattern is an array

*P*[1 . .

*m*] of length

*m*. We further assume that the elements of

*P*and

*T*are characters drawn from a finite alphabet . For example, we may have = {0, 1} or = {a, b, . . . , z}. The character arrays

*P*and

*T*are often called

*of characters.*

**strings**We say that pattern *P* **occurs with shift*** s* in text

*T*(or, equivalently, that pattern

*P*

**occurs beginning at position s****+ 1**in text

*T*) if 0

*s*

*n*–

*m*and

*T*[

*s*+ 1 . .

*s*+

*m*] =

*P*[1 . .

*m*] (that is, if

*T*[

*s*+

*j*] = P[

*j*], for 1

*j*

*m*). If

*P*occurs with shift

*s*in

*T*, then we call

*s*a

*; otherwise, we call*

**valid shift***s*an

*. The string-matching problem is the problem of finding all valid shifts with which a given pattern*

**invalid shift***P*occurs in a given text

*T*. Figure 34.1 illustrates these definitions.

This chapter is organized as follows. In Section 34.1 we review the naive brute-force algorithm for the string-matching problem, which has worst-case running time *O*((*n* – *m* + 1)*m*). Section 34.2 presents an interesting string-matching algorithm, due to Rabin and Karp. This algorithm also has worst-case running time *O*((*n* – *m* + 1)*m*), but it works much better on average and in practice. It also generalizes nicely to other pattern-matching problems. Section 34.3 then describes a string-matching algorithm that begins by constructing a finite automaton specifically designed to search for occurrences of the given pattern *P* in a text. This algorithm runs in time *O*(*n* + *m* ). The similar but much cleverer Knuth-Morris-Pratt (or KMP) algorithm is presented in Section 34.4; the KMP algorithm runs in time *O*(*n* + *m*). Finally, Section 34.5 describes an algorithm due to Boyer and Moore that is often the best practical choice, although its worst-case running time (like that of the Rabin-Karp algorithm) is no better than that of the naive string-matching algorithm.

#### Figure 34.1 The string-matching problem. The goal is to find all occurrences of the pattern P = abaa in the text T = abcabaabcabac. The pattern occurs only once in the text, at shift s = 3. The shift s = 3 is said to be a valid shift. Each character of the pattern is connected by a vertical line to the matching character in the text, and all matched characters are shown shaded.

Notation and terminology

We shall let * (read “sigma-star”) denote the set of all finite-length strings formed using characters from the alphabet . In this chapter, we consider only finite-length strings. The zero-length * empty string*, denoted , also belongs to *. The length of a string

*x*is denoted |

*x*|. The

**concatenation****of two strings**

*x*and

*y*, denoted

*xy*, has length |

*x*| + |

*y*| and consists of the characters from

*x*followed by the characters from

*y*.

We say that a string *w* is a * prefix* of a string

*x*, denoted for some string

*y**. Note that if , then |

*w*| |

*x*|. Similarly, we say that a string

*w*is a

*of a string*

**suffix***x,*denoted for some

*y**. It follows from that |

*w*| |

*x*|. The empty string is both a suffix and a prefix of every string. For example, we have ab abcca and cca abcca. It is useful to note that for any strings

*x*and

*y*and any character

*a*, we have if and only if . Also note that are transitive relations. The following lemma will be useful later.

Lemma 34.1

Suppose that *x*, *y*, and *z* are strings such that . If |*x*| |*y*|, then . If |*x*| |*y*|, then . If |*x*| = |*y*|, then *x* = *y*.

* Proof *See Figure 34.2 for a graphical proof.

For brevity of notation, we shall denote the *k*-character prefix *P*[1 . . *k*] of the pattern *P*[1 . . *m*] by *P _{k}*. Thus,

*P*

_{0}= and

*P*=

_{m}*P*=

*P*[1 . .

*m*]. Similarly, we denote the

*k*-character prefix of the text

*T*as

*T*. Using this notation, we can state the string-matching problem as that of finding all shifts

_{k}*s*in the range 0

*s*

*n*–

*m*such that .

In our pseudocode, we allow two equal-length strings to be compared for equality as a primitive operation. If the strings are compared from left to right and the comparison stops when a mismatch is discovered, we assume that the time taken by such a test is a linear function of the number of matching characters discovered. To be precise, the test “*x* = *y*” is assumed to take time (*t *+ 1), where *t* is the length of the longest string *z* such that .