什么时候 uni-grams 比 bi-grams(或更高的 N-grams)更合适?

When are uni-grams more suitable than bi-grams (or higher N-grams)?

我正在阅读有关 n-gram 的文章,我想知道在实践中是否存在 uni-grams 优于 bi 的情况-grams(或更高 N-grams)。据我了解,N越大,计算概率和建立向量space的复杂性就越大。但除此之外,是否还有其他原因(例如与数据类型相关)?

通常,大于 1 的 n-gram 更好,因为它通常包含更多关于上下文的信息。然而,有时除了二元组和三元组之外,还会计算一元组并用作它们的后备。这也很有用,如果你想要高召回率而不是精确度来搜索一元组,例如,你正在搜索动词 "make" 的所有可能用法。

让我们以统计机器翻译为例: 直觉上,最好的情况是你的模型之前已经看过完整的句子(比如说 6-grams)并且知道它的整体翻译。如果不是这种情况,您尝试将其划分为更小的 n-gram,同时考虑到您对单词 surroundings 了解的信息越多,翻译就越好。例如,如果你想将 "Tom Green" 翻译成德语,如果你看过二元语法,你就会知道它是一个人名,应该保持原样,但如果你的模型从未见过它,你就会退缩到 unigrams 并分别翻译 "Tom" 和 "Green"。因此 "Green" 将被转换为 "Grün" 的颜色,依此类推。

此外,在搜索中更多地了解周围环境会使结果更加准确。

这归结为 data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit 您的数据。

出于这个原因,如果您有相对大量的token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model