什么时候 uni-grams 比 bi-grams（或更高的 N-grams）更合适？

When are uni-grams more suitable than bi-grams (or higher N-grams)?

我正在阅读有关 n-gram 的文章，我想知道在实践中是否存在 uni-grams 优于 bi 的情况-grams（或更高 N-grams）。据我了解，N越大，计算概率和建立向量space的复杂性就越大。但除此之外，是否还有其他原因（例如与数据类型相关）？

通常，大于 1 的 n-gram 更好，因为它通常包含更多关于上下文的信息。然而，有时除了二元组和三元组之外，还会计算一元组并用作它们的后备。这也很有用，如果你想要高召回率而不是精确度来搜索一元组，例如，你正在搜索动词 "make" 的所有可能用法。

让我们以统计机器翻译为例：直觉上，最好的情况是你的模型之前已经看过完整的句子（比如说 6-grams）并且知道它的整体翻译。如果不是这种情况，您尝试将其划分为更小的 n-gram，同时考虑到您对单词 surroundings 了解的信息越多，翻译就越好。例如，如果你想将 "Tom Green" 翻译成德语，如果你看过二元语法，你就会知道它是一个人名，应该保持原样，但如果你的模型从未见过它，你就会退缩到 unigrams 并分别翻译 "Tom" 和 "Green"。因此 "Green" 将被转换为 "Grün" 的颜色，依此类推。

此外，在搜索中更多地了解周围环境会使结果更加准确。

这归结为 data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit 您的数据。

出于这个原因，如果您有相对大量的token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model。

什么时候 uni-grams 比 bi-grams（或更高的 N-grams）更合适？

When are uni-grams more suitable than bi-grams (or higher N-grams)?

nlp

machine-learning

data-mining

n-gram