Thursday, October 25, 2007

Boyer-Moore algorithm

终于搞懂了,Boyer-Moore这个疯子的算法,据说这人正在Uni of Texas 教书,他的学生肯定也很疯狂,搜了好多中文网站都没有详细解释的,还是英文解释的比较具体形象,图文俱全。

KMP算法,还没找到比较好的解释,还是觉得Lecture notes上的比较具体,包括怎么求next表。希望有朝一日不会忘记!

Boyer-Moore algorithm

German version up

Idea

The algorithm of Boyer and Moore [BM 77] compares the pattern with the text from right to left. If the text symbol that is compared with the rightmost pattern symbol does not occur in the pattern at all, then the pattern can be shifted by m positions behind this text symbol. The following example illustrates this situation.

Example:

0123456789...
abbadabacba
babac










babac

The first comparison d-c at position 4 produces a mismatch. The text symbol d does not occur in the pattern. Therefore, the pattern cannot match at any of the positions 0, ..., 4, since all corresponding windows contain a d. The pattern can be shifted to position 5.

The best case for the Boyer-Moore algorithm is attained if at each attempt the first compared text symbol does not occur in the pattern. Then the algorithm requires only O(n/m) comparisons.

Bad character heuristics

This method is called bad character heuristics. It can also be applied if the bad character, i.e. the text symbol that causes a mismatch, occurs somewhere else in the pattern. Then the pattern can be shifted so that it is aligned to this text symbol. The next example illustrates this situation.

Example:

0123456789...
abbababacba
babac







babac



Comparison b-c causes a mismatch. Text symbol b occurs in the pattern at positions 0 and 2. The pattern can be shifted so that the rightmost b in the pattern is aligned to text symbol b.

Good suffix heuristics

Sometimes the bad character heuristics fails. In the following situation the comparison a-b causes a mismatch. An alignment of the rightmost occurence of the pattern symbol a with the text symbol a would produce a negative shift. Instead, a shift by 1 would be possible. However, in this case it is better to derive the maximum possible shift distance from the structure of the pattern. This method is called good suffix heuristics.

Example:

0123456789...
abaababacba
cabab







cabab



The suffix ab has matched. The pattern can be shifted until the next occurence of ab in the pattern is aligned to the text symbols ab, i.e. to position 2.

In the following situation the suffix ab has matched. There is no other occurence of ab in the pattern.Therefore, the pattern can be shifted behind ab, i.e. to position 5.

Example:

0123456789...
abcababacba
cbaab










cbaab

In the following situation the suffix bab has matched. There is no other occurence of bab in the pattern. But in this case the pattern cannot be shifted to position 5 as before, but only to position 3, since a prefix of the pattern (ab) matches the end of bab. We refer to this situation as case 2 of the good suffix heuristics.

Example:

0123456789...
aabababacba
abbab








abbab


The pattern is shifted by the longest of the two distances that are given by the bad character and the good suffix heuristics.


Reference:


http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/bmen.हतं


http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.हतं


No comments: