quanteda::dfm_lookup():捕获找到的术语
quanteda::dfm_lookup(): capture found term
我想在字典上执行令人惊叹的 quanteda
的 dfm_lookup()
,但还要检索匹配项。
考虑以下示例:
dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")),
remove = stopwords("english"))
dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)
这给了我:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas opposition taxglob taxregex country
text1 1 1 1 0 0
text2 0 0 1 0 2
但是,由于每个词典工具也有多个条目,我想知道哪个标记产生了匹配项。 (我的真实字典很长,所以这个例子可能看起来微不足道,但对于实际用例来说,事实并非如此。)
我想达到这样的结果:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas christmas.match opposition opposition.match taxglob taxglob.match taxregex taxreg.match country country.match
text1 1 Christmas 1 Opposition 1 tax 0 NA 0 NA
text2 0 NA 0 NA 1 taxation 0 NA 2 United_States, Sweden
有人可以帮我解决这个问题吗?提前谢谢了! :)
这不太可能,原因有二。
首先,矩阵(类似)对象(dfm 或其他)不能混合元素模式,这里是计数和字符值的混合。这在 data.frame 中是可能的,但是你失去了稀疏性的优势,在这里,你将有一个 n x 2*V(其中 V = 特征数)data.frame 个维度。
其次,“christmas.match”可能有多个 feature/token 与之匹配,因此字符值需要一个列表,使对象 class 更加紧张。
更好的方法是使用 kwic()
将标记与字典形成的模式相匹配。您可以通过以 pattern()
形式提供字典来为键执行此操作,或者取消列出字典以获取每个值的匹配项。
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
toks <- tokens(c(d1 = "a b c d e f g and another"))
# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f one
## 2 d1 2 2 a b c d e f g one
## 3 d1 5 5 a b c d e f g and another two
## 4 d1 6 6 a b c d e f g and another two
## 5 d1 8 8 c d e f g and another one
## 6 d1 9 9 d e f g and another one
# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f a*
## 2 d1 2 2 a b c d e f g b
## 3 d1 5 5 a b c d e f g and another e
## 4 d1 6 6 a b c d e f g and another f
## 5 d1 8 8 c d e f g and another a*
## 6 d1 9 9 d e f g and another a*
我想在字典上执行令人惊叹的 quanteda
的 dfm_lookup()
,但还要检索匹配项。
考虑以下示例:
dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")),
remove = stopwords("english"))
dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)
这给了我:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas opposition taxglob taxregex country
text1 1 1 1 0 0
text2 0 0 1 0 2
但是,由于每个词典工具也有多个条目,我想知道哪个标记产生了匹配项。 (我的真实字典很长,所以这个例子可能看起来微不足道,但对于实际用例来说,事实并非如此。)
我想达到这样的结果:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas christmas.match opposition opposition.match taxglob taxglob.match taxregex taxreg.match country country.match
text1 1 Christmas 1 Opposition 1 tax 0 NA 0 NA
text2 0 NA 0 NA 1 taxation 0 NA 2 United_States, Sweden
有人可以帮我解决这个问题吗?提前谢谢了! :)
这不太可能,原因有二。
首先,矩阵(类似)对象(dfm 或其他)不能混合元素模式,这里是计数和字符值的混合。这在 data.frame 中是可能的,但是你失去了稀疏性的优势,在这里,你将有一个 n x 2*V(其中 V = 特征数)data.frame 个维度。
其次,“christmas.match”可能有多个 feature/token 与之匹配,因此字符值需要一个列表,使对象 class 更加紧张。
更好的方法是使用 kwic()
将标记与字典形成的模式相匹配。您可以通过以 pattern()
形式提供字典来为键执行此操作,或者取消列出字典以获取每个值的匹配项。
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
toks <- tokens(c(d1 = "a b c d e f g and another"))
# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f one
## 2 d1 2 2 a b c d e f g one
## 3 d1 5 5 a b c d e f g and another two
## 4 d1 6 6 a b c d e f g and another two
## 5 d1 8 8 c d e f g and another one
## 6 d1 9 9 d e f g and another one
# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f a*
## 2 d1 2 2 a b c d e f g b
## 3 d1 5 5 a b c d e f g and another e
## 4 d1 6 6 a b c d e f g and another f
## 5 d1 8 8 c d e f g and another a*
## 6 d1 9 9 d e f g and another a*