quanteda::dfm_lookup():捕获找到的术语

quanteda::dfm_lookup(): capture found term

我想在字典上执行令人惊叹的 quantedadfm_lookup(),但还要检索匹配项。

考虑以下示例:

dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
                        opposition = c("Opposition", "reject", "notincorpus"),
                        taxglob = "tax*",
                        taxregex = "tax.+$",
                        country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
                      "Does the United_States or Sweden have more progressive taxation?")),
             remove = stopwords("english"))

dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)

这给了我:

Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
       features
docs    christmas opposition taxglob taxregex country
  text1         1          1       1        0       0
  text2         0          0       1        0       2

但是,由于每个词典工具也有多个条目,我想知道哪个标记产生了匹配项。 (我的真实字典很长,所以这个例子可能看起来微不足道,但对于实际用例来说,事实并非如此。)

我想达到这样的结果:

Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs    christmas  christmas.match  opposition  opposition.match  taxglob  taxglob.match  taxregex  taxreg.match  country          country.match
text1         1          Christmas         1          Opposition      1              tax         0            NA        0                     NA
text2         0                 NA         0                  NA      1         taxation         0            NA        2  United_States, Sweden

有人可以帮我解决这个问题吗?提前谢谢了! :)

这不太可能,原因有二。

首先,矩阵(类似)对象(dfm 或其他)不能混合元素模式,这里是计数和字符值的混合。这在 data.frame 中是可能的,但是你失去了稀疏性的优势,在这里,你将有一个 n x 2*V(其中 V = 特征数)data.frame 个维度。

其次,“christmas.match”可能有多个 feature/token 与之匹配,因此字符值需要一个列表,使对象 class 更加紧张。

更好的方法是使用 kwic() 将标记与字典形成的模式相匹配。您可以通过以 pattern() 形式提供字典来为键执行此操作,或者取消列出字典以获取每个值的匹配项。

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))

toks <- tokens(c(d1 = "a b c d e f g and another"))

# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
  as.data.frame()
##   docname from to         pre keyword            post pattern
## 1      d1    1  1                   a       b c d e f     one
## 2      d1    2  2           a       b       c d e f g     one
## 3      d1    5  5     a b c d       e f g and another     two
## 4      d1    6  6   a b c d e       f   g and another     two
## 5      d1    8  8   c d e f g     and         another     one
## 6      d1    9  9 d e f g and another                     one

# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
  as.data.frame()
##   docname from to         pre keyword            post pattern
## 1      d1    1  1                   a       b c d e f      a*
## 2      d1    2  2           a       b       c d e f g       b
## 3      d1    5  5     a b c d       e f g and another       e
## 4      d1    6  6   a b c d e       f   g and another       f
## 5      d1    8  8   c d e f g     and         another      a*
## 6      d1    9  9 d e f g and another                      a*