使用 R 检查一个 table 中的关键字是否在另一个 table 中的字符串中
Checking if keyword in one table is within a string in another table using R
我一直在尝试使用 mapply 解决这个问题,但我相信我将不得不使用多个嵌套的 apply 来完成这项工作,而且它变得非常混乱。
问题如下:
数据框一包含大约 400 个关键字。这些大致分为 15 个类别。
数据框二包含一个字符串描述字段和 15 个附加列,每个列的名称对应于数据框一中提到的类别。这有数百万行。
如果数据框 1 中的关键字存在于数据框 2 的字符串字段中,则应在数据框 2 中标记关键字所在的类别。
我想要的应该是这样的:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
我尝试使用 mapply 让它工作但它失败了,给我错误 "longer argument not a multiple of length of shorter"。它还仅为 df2 中的第一个字符串计算此值。我还没有超越这个阶段,即试图获得类别标志。
> mapply(grepl, pattern = df1$keyword, x = df2$description)
有人能帮忙吗?我非常感谢你。我是 R 的新手,所以如果有人可以提及一些 'thumb rules' 将循环转换为应用函数,它也会有所帮助。我不能使用循环来解决这个问题,因为它会花费太多时间。
可能有更优雅的方法来做到这一点,但这是我想出的:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
然而,这种方法也捕获了 "pigeon" 中的 "pig" 和 "catfish" 中的 "cat"。不过,我不知道这是不是你想要的。
无论实施方式如何,计算每个类别的匹配数需要 k x d
比较,其中 k
是关键字数,d
是描述数。
有一些技巧可以快速解决这个问题并且不需要太多内存:
- 使用矢量化运算。这些可以比使用 for 循环更快地执行。请注意,lapply、mapply 或 vapply 只是 shorthand for for 循环。我对关键字进行并行化(请参阅下一个),以便矢量化可以覆盖最大维度的描述。
- 使用并行化。最佳地使用您的多核以增加内存为代价加快进程(因为每个核都需要自己的副本)。
示例:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
在我的电脑上,这需要 140 秒和 14GB RAM 才能将 15 个类别中的 400 个关键词与 300 万条描述相匹配。
你要找的是所谓的document-term-matrix(简称dtm),它源于NLP(Natural Language Processing)。有很多选项可用。我更喜欢text2vec
。这个包非常快(如果它在很大程度上优于这里的其他解决方案,我不会感到惊讶)特别是与 tokenizers
结合使用。
在你的例子中,代码看起来像这样:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B
我一直在尝试使用 mapply 解决这个问题,但我相信我将不得不使用多个嵌套的 apply 来完成这项工作,而且它变得非常混乱。
问题如下:
数据框一包含大约 400 个关键字。这些大致分为 15 个类别。 数据框二包含一个字符串描述字段和 15 个附加列,每个列的名称对应于数据框一中提到的类别。这有数百万行。
如果数据框 1 中的关键字存在于数据框 2 的字符串字段中,则应在数据框 2 中标记关键字所在的类别。
我想要的应该是这样的:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
我尝试使用 mapply 让它工作但它失败了,给我错误 "longer argument not a multiple of length of shorter"。它还仅为 df2 中的第一个字符串计算此值。我还没有超越这个阶段,即试图获得类别标志。
> mapply(grepl, pattern = df1$keyword, x = df2$description)
有人能帮忙吗?我非常感谢你。我是 R 的新手,所以如果有人可以提及一些 'thumb rules' 将循环转换为应用函数,它也会有所帮助。我不能使用循环来解决这个问题,因为它会花费太多时间。
可能有更优雅的方法来做到这一点,但这是我想出的:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
然而,这种方法也捕获了 "pigeon" 中的 "pig" 和 "catfish" 中的 "cat"。不过,我不知道这是不是你想要的。
无论实施方式如何,计算每个类别的匹配数需要 k x d
比较,其中 k
是关键字数,d
是描述数。
有一些技巧可以快速解决这个问题并且不需要太多内存:
- 使用矢量化运算。这些可以比使用 for 循环更快地执行。请注意,lapply、mapply 或 vapply 只是 shorthand for for 循环。我对关键字进行并行化(请参阅下一个),以便矢量化可以覆盖最大维度的描述。
- 使用并行化。最佳地使用您的多核以增加内存为代价加快进程(因为每个核都需要自己的副本)。
示例:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
在我的电脑上,这需要 140 秒和 14GB RAM 才能将 15 个类别中的 400 个关键词与 300 万条描述相匹配。
你要找的是所谓的document-term-matrix(简称dtm),它源于NLP(Natural Language Processing)。有很多选项可用。我更喜欢text2vec
。这个包非常快(如果它在很大程度上优于这里的其他解决方案,我不会感到惊讶)特别是与 tokenizers
结合使用。
在你的例子中,代码看起来像这样:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B