quanteda 字典中的逻辑组合
Logical combinations in quanteda dictionaries
我正在使用 quanteda 字典查找。我将尝试制定可以查找单词逻辑组合的条目。
例如:
Teddybear = (fluffy AND adorable AND soft)
这可能吗?我只找到了一个解决方案来测试像 (Teddybear = (soft fluffy adorable))
这样的短语。但它必须是文本中的精确短语匹配。但是我怎样才能得到忽略单词顺序的结果呢?
这目前在 quanteda (v1.2.0) 中无法直接实现。但是,有一些解决方法,您可以在其中创建字典序列,这些序列是所需序列的排列。这是一个这样的解决方案。
首先,我将创建一些示例文本。请注意,在某些情况下,序列由“,”或 "and" 分隔。此外,第三个文本只有两个短语而不是三个。 (稍后会详细介绍。)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
现在,让我们生成一对函数来从向量生成置换序列和子序列。这些将使用 combinat 包中的一些函数。第一个是生成排列的内部函数,第二个是可以生成全长排列或低至 subsample_limit
的任何子样本的主调用函数。 (当然,为了更普遍地使用它们,我会添加错误检查,但我在这个例子中跳过了它。)
genperms <- function(vec) {
combs <- combinat::permn(vec)
sapply(combs, paste, collapse = " ")
}
# vec any vector
# subsample_limit integer from 1 to length(vec), subsamples from
# which to return permutations; default is no subsamples
permutefn <- function(vec, subsample_limit = length(vec)) {
ret <- character()
for (i in length(vec):subsample_limit) {
ret <- c(ret,
unlist(lapply(combinat::combn(vec, i, simplify = FALSE),
genperms)))
}
ret
}
为了演示这些是如何工作的:
fas <- c("fluffy", "adorable", "soft")
permutefn(fas)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# and with subsampling:
permutefn(fas, 2)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# [7] "fluffy adorable" "adorable fluffy" "fluffy soft"
# [10] "soft fluffy" "adorable soft" "soft adorable"
现在使用 tokens_lookup()
将这些应用于文本。我通过设置 remove_punct = TRUE
避免了标点符号问题。为了显示未替换的原始标记,我还使用了 exclusive = FALSE
.
tokens(txt, remove_punct = TRUE) %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "fluffy" "adorable" "and" "soft"
# [8] "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the"
# [8] "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
这里第一种情况没有抓到,因为第二个和第三个元素被"and"隔开。我们可以使用 tokens_remove()
删除它,然后得到匹配项:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
最后,为了匹配三个字典元素中只有两个存在的第三个文本,我们可以将 2
作为 subsample_limit
参数传递:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas, 2))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "TEDDYBEAR" "toy" "was" "shaped" "like" "a"
# [8] "bear"
#
如果你想知道哪些文档有所有的单词,只需这样做:
require(quanteda)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
dict <- dictionary(list(teddybear = list(c1 = "fluffy", c2 = "adorable", c3 = "soft")))
mt <- dfm_lookup(dfm(txt), dictionary = dict["teddybear"], levels = 2)
cbind(mt, "teddybear" = as.numeric(rowSums(mt > 0) == length(dict[["teddybear"]])))
# Document-feature matrix of: 3 documents, 4 features (16.7% sparse).
# 3 x 4 sparse Matrix of class "dfm"
# features
# docs c1 c2 c3 teddybear
# text1 1 1 1 1
# text2 1 1 1 1
# text3 1 1 0 0
我正在使用 quanteda 字典查找。我将尝试制定可以查找单词逻辑组合的条目。
例如:
Teddybear = (fluffy AND adorable AND soft)
这可能吗?我只找到了一个解决方案来测试像 (Teddybear = (soft fluffy adorable))
这样的短语。但它必须是文本中的精确短语匹配。但是我怎样才能得到忽略单词顺序的结果呢?
这目前在 quanteda (v1.2.0) 中无法直接实现。但是,有一些解决方法,您可以在其中创建字典序列,这些序列是所需序列的排列。这是一个这样的解决方案。
首先,我将创建一些示例文本。请注意,在某些情况下,序列由“,”或 "and" 分隔。此外,第三个文本只有两个短语而不是三个。 (稍后会详细介绍。)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
现在,让我们生成一对函数来从向量生成置换序列和子序列。这些将使用 combinat 包中的一些函数。第一个是生成排列的内部函数,第二个是可以生成全长排列或低至 subsample_limit
的任何子样本的主调用函数。 (当然,为了更普遍地使用它们,我会添加错误检查,但我在这个例子中跳过了它。)
genperms <- function(vec) {
combs <- combinat::permn(vec)
sapply(combs, paste, collapse = " ")
}
# vec any vector
# subsample_limit integer from 1 to length(vec), subsamples from
# which to return permutations; default is no subsamples
permutefn <- function(vec, subsample_limit = length(vec)) {
ret <- character()
for (i in length(vec):subsample_limit) {
ret <- c(ret,
unlist(lapply(combinat::combn(vec, i, simplify = FALSE),
genperms)))
}
ret
}
为了演示这些是如何工作的:
fas <- c("fluffy", "adorable", "soft")
permutefn(fas)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# and with subsampling:
permutefn(fas, 2)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# [7] "fluffy adorable" "adorable fluffy" "fluffy soft"
# [10] "soft fluffy" "adorable soft" "soft adorable"
现在使用 tokens_lookup()
将这些应用于文本。我通过设置 remove_punct = TRUE
避免了标点符号问题。为了显示未替换的原始标记,我还使用了 exclusive = FALSE
.
tokens(txt, remove_punct = TRUE) %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "fluffy" "adorable" "and" "soft"
# [8] "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the"
# [8] "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
这里第一种情况没有抓到,因为第二个和第三个元素被"and"隔开。我们可以使用 tokens_remove()
删除它,然后得到匹配项:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
最后,为了匹配三个字典元素中只有两个存在的第三个文本,我们可以将 2
作为 subsample_limit
参数传递:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas, 2))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "TEDDYBEAR" "toy" "was" "shaped" "like" "a"
# [8] "bear"
#
如果你想知道哪些文档有所有的单词,只需这样做:
require(quanteda)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
dict <- dictionary(list(teddybear = list(c1 = "fluffy", c2 = "adorable", c3 = "soft")))
mt <- dfm_lookup(dfm(txt), dictionary = dict["teddybear"], levels = 2)
cbind(mt, "teddybear" = as.numeric(rowSums(mt > 0) == length(dict[["teddybear"]])))
# Document-feature matrix of: 3 documents, 4 features (16.7% sparse).
# 3 x 4 sparse Matrix of class "dfm"
# features
# docs c1 c2 c3 teddybear
# text1 1 1 1 1
# text2 1 1 1 1
# text3 1 1 0 0