R inspect() 函数，来自 tm 包，使用字典术语时只有 returns 10 个输出

Question

我有 70 篇科学论文的 PDF，我试图通过使用 inspect() 的词典功能在其中查找特定术语来缩小范围，这是 tm 包的一部分。我的 PDF 存储在 VCorpus 对象中。这是我的代码使用原始数据集和常见术语的示例，这些术语可能会出现在（可能）每个原始示例论文中：

library(tm)
output.matrix <- inspect(DocumentTermMatrix(crude,
                                      list(dictionary = c("i","and",
                                                          "all","of",
                                                          "the","if",
                                                          "i'm","looking",
                                                          "for","but","because","has",
                                                          "it","was"))))
output <- data.frame(output.matrix)

本次搜索仅 returns 10 篇论文 output.matrix。给出的结果是：

Docs  all and because but for has i i'm the was
  144   0   9       0   5   5   2 0   0  17   1
  236   0   7       4   2   4   5 0   0  15   7
  237   1  11       1   3   3   2 0   0  30   2
  246   0   9       0   0   6   1 0   0  18   2
  248   1   6       1   1   2   0 0   0  27   4
  273   0   5       2   2   4   1 0   0  21   1
  368   0   1       0   1   0   0 0   0  11   2
  489   0   5       0   0   4   0 0   0   8   0
  502   0   6       0   1   5   0 0   0  13   0
  704   0   5       1   0   3   2 0   0  21   0

对于我的 70 篇论文的实际数据集，我知道应该多于 10 篇，因为当我向我的 VCorpus 添加更多 PDF 时，我知道它至少包含一个我的搜索词，我仍然只能在输出。我想将结果调整为一个列表，如图所示，它给出了 VCorpus 中包含术语的每篇论文，而不仅仅是我假设的前 10 个。

使用 R 版本 4.0.2，macOS High Sierra 10.13.6

Answer 1

您误解了 inspect 的作用。对于文档术语矩阵，它显示前 10 行和前 10 列。 inspect 应该只用于检查你的语料库或文档术语矩阵，如果它看起来像你期望的那样。永远不要将数据转换为 data.frame。如果您想要 data.frame 中的文档术语矩阵的数据，以下代码使用您的示例代码并删除所有没有任何文档值的行和列或条款。

# do not use inspect as this will give a wrong result!
output.matrix <- DocumentTermMatrix(crude,
                                    list(dictionary = c("i","and",
                                                        "all","of",
                                                        "the","if",
                                                        "i'm","looking",
                                                        "for","but","because","has",
                                                        "it","was")))


# remove rows and columns that are 0 staying inside a sparse matrix for speed
out <- output.matrix[slam::row_sums(output.matrix) > 0,
                     slam::col_sums(output.matrix) > 0]


# transform to data.frame
out_df <- data.frame(docs = row.names(out), as.matrix(out), row.names = NULL)

out_df
   docs all and because but for. has the was
1   127   0   1       0   0    2   0   5   1
2   144   0   9       0   5    5   2  17   1
3   191   0   0       0   0    2   0   4   0
4   194   1   1       0   0    2   0   4   1
5   211   0   2       0   0    2   0   8   0
6   236   0   7       4   2    4   5  15   7
7   237   1  11       1   3    3   2  30   2
8   242   0   3       0   1    1   1   6   1
9   246   0   9       0   0    6   1  18   2
10  248   1   6       1   1    2   0  27   4
11  273   0   5       2   2    4   1  21   1
12  349   0   2       0   0    0   0   5   0
13  352   0   3       0   0    0   0   7   1
14  353   0   1       0   0    2   1   4   3
15  368   0   1       0   1    0   0  11   2
16  489   0   5       0   0    4   0   8   0
17  502   0   6       0   1    5   0  13   0
18  543   0   0       0   0    3   0   5   1
19  704   0   5       1   0    3   2  21   0
20  708   0   0       0   0    0   0   0   1

R inspect() 函数，来自 tm 包，使用字典术语时只有 returns 10 个输出

R inspect() function, from tm package, only returns 10 outputs when using dictionary terms

pdf

r

text-mining