在多个文档中查找多词字符串

Question

为了在文档中查找常用术语或短语，有人可以使用 tf。

如果我们知道文本中有一些特定的表达但我们不知道长度或者包含任何其他信息，那么有什么办法可以找到它们？示例：

df <- data.frame(text = c("Introduction Here you see something Related work another info here", "Introduction another text Background work something to now"))

假设这些词是 Introducton、Related work 和 Background work，但我们不知道具体是哪些短语。我们怎样才能找到他们？

Answer 1

这里你需要一种检测搭配的方法，幸运的是 quanteda 有 textstat_collocations() 的形式。一旦你检测到这些，你就可以将你的令牌组合成一个单一的“令牌”，然后以标准方式获得它们的频率。

不需要提前知道长度，但需要指定一个范围。下面，我添加了更多文本，并包含了从 2 到 3 的大小范围。这也提取了“犯罪背景调查”，而没有混淆短语“背景工作”中的术语“背景”。（默认情况下，检测不区分大小写。）

library("quanteda")
## Package version: 2.1.0

text <- c(
  "Introduction Here you see something Related work another info here",
  "Introduction another text Background work something to now",
  "Background work is related to related work",
  "criminal background checks are useful",
  "The law requires criminal background checks"
)

colls <- textstat_collocations(text, size = 2:3)
colls
##                  collocation count count_nested length    lambda          z
## 1        criminal background     2            2      2  4.553877  2.5856967
## 2          background checks     2            2      2  4.007333  2.3794386
## 3               related work     2            2      2  2.871680  2.3412833
## 4            background work     2            2      2  2.322388  2.0862256
## 5 criminal background checks     2            0      3 -1.142097 -0.3426584

在这里我们可以看到正在检测和区分短语。现在我们可以使用 tokens_compound 加入他们：

toks <- tokens(text) %>%
  tokens_compound(colls, concatenator = " ")

dfm(toks) %>%
  dfm_trim(min_termfreq = 2) %>%
  dfm_remove(stopwords("en")) %>%
  textstat_frequency()
##                      feature frequency rank docfreq group
## 1               introduction         2    1       2   all
## 2                  something         2    1       2   all
## 3                    another         2    1       2   all
## 4               related work         2    1       2   all
## 5            background work         2    1       2   all
## 6 criminal background checks         2    1       2   all

在多个文档中查找多词字符串

Find multi-word strings in more than one document

r

quanteda