使用布隆过滤器的断字算法

Hyphenation algorithm using Bloom filter

Bloom 过滤器大放异彩的经典示例是 hyphenation algorithms. It's even the example given in the original paper on Bloom filters

我不明白如何在断字算法中使用布隆过滤器。

断字算法被定义为接受一个输入词并返回该词可以被断字的可能方式的东西。

Bloom 过滤器是否同时包含 hyph-enationhyphena-tion,并且客户端代码会查询 h-yphenationhy-phenationhyp-henation 的过滤器。 ..?

原文是这样说的:

Analysis of Hyphenation Sample Application

[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.

对于这种情况,

  • 字典存储在磁盘上,包含所有连字符正确的单词,
  • 布隆过滤器只包含需要特殊连字符的键,例如也许 hyphenation 本身,
  • 布隆过滤器响应 "probably" 或 "no"。

那么查找单词可能连字的算法是:

word = "hyphenation"; (or some other word)
x = bloomFilter.probablyContains(word);
if (x == "probably") {
    lookupInDictionary(word).getHypenation();
} else {
    // x == "no" case
    useSimpleRuleBasedHypenation(word);
}

如果布隆过滤器以 "probably" 响应,则算法必须在字典中读取磁盘。

如果实际上没有特殊规则,Bloom 过滤器有时会响应 "probably",在这种情况下磁盘 I/O 是不必要的。但只要这种情况不会经常发生就可以了(误报率很低,例如 1/16)。

Bloom 过滤器,因为它没有假阴性,永远不会用 "no" 响应 特殊连字符的情况。