为什么我找不到 number of bigrams = number_of_words - 1？

Question

我正在编写一个 R 脚本来查找双字母组。

我有一串4157个单词。

现在，使用 stylo，我在向量中采用二元语法，如下所示。

library(stylo)

allBi <- txt.to.words(myLines)
myBigrams <- make.ngrams(allBi, ngram.size = 2)

那只有returns 4120个双字母组。有什么问题？

Answer 1

问题是您没有进行测试来弄清楚发生了什么。

从下面的测试来看，myLines 中的 4,127 个条目中的一个（或多个）似乎实际上不包含 "word"，因为 style 包看到以下单词：

library(stylo)

这个文件在我的 OS X 系统上有 235,886 个合法单词：

words <- readLines("/usr/share/dict/words")

现在，执行测试以查看是否有任何与矢量长度相关的因素影响 make.ngrams 或（更有可能）txt.to.words。注意：我不想等待 make.ngrams 的 cpl 分钟来完成高达 235,886 的序列，所以我将它设为 20,000，远高于你的 4,120：

all(sapply(seq(from=2, to=20000, by=100), function(i) {
  return(i - length(make.ngrams(txt.to.words(words[1:i]), ngram.size=2))==1)
}))
# [1] TRUE

所以，这不是向量大小的问题。会不会是向量中缺少实际单词的问题？让我们测试一下：

# inject some badness
words[4] <- sprintf("  , %s - ", words[4])
words[30] <- "//"
words[900] <- "-1--1-"
words[4000]  <- ".."

再试一次：

all(sapply(seq(from=2, to=20000, by=100), function(i) {
  return(i - length(make.ngrams(txt.to.words(words[1:i]), ngram.size=2))==1)
}))
# [1] FALSE

让我们看看 txt.to.words 它对真正的 "badness" 做了什么：

txt.to.words(words[c(4, 30, 900, 4000)])
# [1] "aal"

使用它来查找 words 中没有字母的条目：

which(grepl("^[^[:alpha:]]+$", words))
# [1]   30  900 4000

测试 FTW（当事情没有按预期进行时，实际执行一些测试并没有太多工作）。

Why I can not find number of bigrams = number_of_words - 1?