使用 R 检查一个 table 中的关键字是否在另一个 table 中的字符串中

Checking if keyword in one table is within a string in another table using R

我一直在尝试使用 mapply 解决这个问题,但我相信我将不得不使用多个嵌套的 apply 来完成这项工作,而且它变得非常混乱。

问题如下:

数据框一包含大约 400 个关键字。这些大致分为 15 个类别。 数据框二包含一个字符串描述字段和 15 个附加列,每个列的名称对应于数据框一中提到的类别。这有数百万行。

如果数据框 1 中的关键字存在于数据框 2 的字符串字段中,则应在数据框 2 中标记关键字所在的类别。

我想要的应该是这样的:

    > #Dataframe1 df1
    >> keyword    category
    >> cat        A
    >> dog        A
    >> pig        A
    >> crow       B
    >> pigeon     B
    >> hawk       B
    >> catfish    C
    >> carp       C
    >> ...
    >>
    > #Dataframe2 df2
    >> description    A    B    C    ....
    >> false cat      1    0    0    ....
    >> smiling pig    1    0    0    ....
    >> shady pigeon   0    1    0    ....
    >> dogged dog     2    0    0    ....
    >> sad catfish    0    0    1    ....
    >> hawkward carp  0    1    1    ....
    >> ....

我尝试使用 mapply 让它工作但它失败了,给我错误 "longer argument not a multiple of length of shorter"。它还仅为 df2 中的第一个字符串计算此值。我还没有超越这个阶段,即试图获得类别标志。

    > mapply(grepl, pattern = df1$keyword, x = df2$description)

有人能帮忙吗?我非常感谢你。我是 R 的新手,所以如果有人可以提及一些 'thumb rules' 将循环转换为应用函数,它也会有所帮助。我不能使用循环来解决这个问题,因为它会花费太多时间。

可能有更优雅的方法来做到这一点,但这是我想出的:

## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"), 
    category = c("A", "A", "A", "B", "B", "B", "C", "C")), 
    .Names = c("keyword", "category"), 
    class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
    .Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")), 
    .Names = "description", row.names = c(NA, -6L), class = "data.frame")

## Load packages:
library(stringr)
library(dplyr)
library(tidyr)

## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
        outDf <- data.frame(description = description,
                value = vapply(stringr::str_extract_all(description, df1$keyword), 
                        length, numeric(1)), category = df1$category) 
    })

## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
    group_by(description, category) %>%
    dplyr::summarise(value = sum(value))

## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
    value = value)

outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
# 
#     description     A     B     C
# *        <fctr> <dbl> <dbl> <dbl>
# 1    dogged dog     2     0     0
# 2     false cat     1     0     0
# 3 hawkward carp     0     1     1
# 4   sad catfish     1     0     1
# 5  shady pigeon     1     1     0
# 6   smiling pig     1     0     0

然而,这种方法也捕获了 "pigeon" 中的 "pig" 和 "catfish" 中的 "cat"。不过,我不知道这是不是你想要的。

无论实施方式如何,计算每个类别的匹配数需要 k x d 比较,其中 k 是关键字数,d 是描述数。

有一些技巧可以快速解决这个问题并且不需要太多内存:

  • 使用矢量化运算。这些可以比使用 for 循环更快地执行。请注意,lapply、mapply 或 vapply 只是 shorthand for for 循环。我对关键字进行并行化(请参阅下一个),以便矢量化可以覆盖最大维度的描述。
  • 使用并行化。最佳地使用您的多核以增加内存为代价加快进程(因为每个核都需要自己的副本)。

示例:

keywords            <- stringi::stri_rand_strings(400, 2)
categories          <- letters[1:15]
keyword_categories  <- sample(categories, 400, TRUE)
descriptions        <- stringi::stri_rand_strings(3e6, 20)

keyword_occurance <- function(word, list_of_descriptions) {
  description_keywords   <- str_detect(list_of_descriptions, word)
}

category_occurance <- function(category, mat) {
  rowSums(mat[,keyword_categories == category])
}

list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords   <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)

在我的电脑上,这需要 140 秒和 14GB RAM 才能将 15 个类别中的 400 个关键词与 300 万条描述相匹配。

你要找的是所谓的document-term-matrix(简称dtm),它源于NLP(Natural Language Processing)。有很多选项可用。我更喜欢text2vec。这个包非常快(如果它在很大程度上优于这里的其他解决方案,我不会感到惊讶)特别是与 tokenizers 结合使用。

在你的例子中,代码看起来像这样:

# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"), 
                      category = c("A", "A", "A", "B", "B", "B", "C", "C")), 
                 .Names = c("keyword", "category"), 
                 class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
                                              .Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")), 
                 .Names = "description", row.names = c(NA, -6L), class = "data.frame")

# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long

# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))

# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)

# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description

# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]

# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword   description value category
# 1    carp hawkward carp     1        C
# 2     cat     false cat     1        A
# 3 catfish   sad catfish     1        C
# 4     dog    dogged dog     1        A
# 5     pig   smiling pig     1        A
# 6  pigeon  shady pigeon     1        B