<s> 和 </s> 在计算 unigram LM 时如何处理？

how to treat with <s> and </s> in calculating unigram LM?

nlp
language-model

我是 NLP 的初学者，我很困惑如何处理 <s> 和 </s> 符号来计算 unigram 模型的计数？我应该计算它们还是忽略它们？

如果我正确理解 <s> 和 </s> 表示特殊（假）unigrams 作为每个文本的第一个和最后一个 unigrams（实际上，先前后后），然后它们不需要 unigrams，因为任何字符串都包含这些 unigrams，因此它们不提供额外信息。

这种特殊的 unigrams 在高阶 n-grams 的情况下很有用：例如，它允许从像 hello 这样的 1-word 字符串中提取 2 个二元组：<s> hello 和 hello </s> 或 3 个八卦：<s0> <s1> hello、<s1> hello </s1>、hello </s1> </s0>.

<s> 和 </s> 在计算 unigram LM 时如何处理？

how to treat with <s> and </s> in calculating unigram LM?

nlp

language-model