向量化查找 R 中出现的循环

Vectorize for loop that finds occurrences in R

我在数据集中有一个变量,其中包含我想对其进行字符串搜索的短语 (female$Var2)。我想找到每个短语在另一个数据框中的行数 (female_df$MH2)。例如,female$Var2 看起来像:

myocardial infarction drug therapy
imipramine poisoning
oximetry
thrombosis drug therapy
angioedema chemically induced

我想在数据框中找到包含上述每个短语的行数female_df$MH2,看起来像这样

oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy

所以我的结果输出应该是这样的:

myocardial infarction drug therapy          1
imipramine poisoning                        1
oximetry                                    2
thrombosis drug therapy                     2
angioedema chemically induced               1

请注意,这不是总发生次数(参见血管性水肿...)。它是包含该短语的行数。我目前正在 运行 一个 for 循环,它花费的时间太长,因为它在 428,000 多行中搜索 5,000 多个术语。当我尝试使用 occurrences_female(female$Var2) 向量化我的函数时,出现 In grepl(word, female_df$MH2, ignore.case = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 错误,仅返回第一个 female$Var2

的变量

我就是这个for循环运行

for (i in 1:nrow(female))
{
  word <- female$Var2[i]
  df_female <- data.frame(word, occurrences_female(word))
  df_female2 <- rbind(df_female2, df_female)
}

基于此函数

occurrences_female <- function(word)
{
  # inserts \b in the beginning
  word <- paste0("\b", word)

  # inserts \b at the end
  n <- nchar(word)
  word <- paste(substr(word, 1, n), "\b", sep = "")

  occurrences <- sum(grepl(word, female_df$MH2, ignore.case = TRUE))

  return (occurrences)
}

该函数在我手动执行时有效,但我需要在 5,000 多个条件下完成它并且 for 循环太慢了(已经 运行 超过 2 小时)。我不知道如何在来自不同数据帧的变量上搜索数据帧的一个变量。

在您上面的解决方案中,不断使用 rbind 将每一行添加到数据框在处理时间方面非常昂贵。

这是一个使用 stringr 包的解决方案。

#Data set up
var2<-c("myocardial infarction drug therapy", "imipramine poisoning", "oximetry",
             "thrombosis drug therapy", "angioedema chemically induced")
female<-data.frame(var2, stringsAsFactors = FALSE)

MH2<-c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
                "abo blood group system, imipramine poisoning, adverse effects",
                "isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
                "thrombosis drug therapy")
female_df<-data.frame(MH2, stringsAsFactors = FALSE)

library(stringr)
#create a matrix where columns is the terms
# and the rows are the lines checked.
termmatrix<-sapply(female$var2, function(x){str_count(female_df$MH2, x)})
#find the sums of the columns to determine the number of times each term is used
ans<-colSums(termmatrix)

最后的 ans 是一个带有项和总计数的命名向量。

加法
为避免创建巨大的术语矩阵,请尝试:

ans<-sapply(female$var2, function(x){length(grep(x, female_df$MH2))})

对 Luis 的回答稍作修改

只有基础 R 的解决方案(假设您的 female$VAR2 只有唯一字符串):

counts <- sapply(female$VAR2, function(x){ z <- sum(grepl(pattern = x,
                                                    x = female_df$MH2,
                                                    ignore.case = TRUE))
                                      z
                                     })
word_counts <- cbind(female$VAR2, counts)

总结

我们可以使用下面的代码来完成任务。基准测试表明,这是一个具有良好性能的方法。

library(purrr)
library(stringr)

female$Count <- map_int(female$Var2, 
                    function(x){sum(str_detect(female_df$MH2, pattern = x))})

简介

有多种方法可以计算每个单词或短语包含多少行。但是根据到目前为止该线程中的答案和讨论,实现这一目标的一般策略。

  1. 使用一个函数来向量化操作,例如来自基础 R 的 lapplysapply,或来自 purrr 包的 map 函数。
  2. 使用函数来计算或检测特定模式(单词或短语)是否在字符串中。这些函数类似于基础 R 中的 grepgrepl,或 stringr 包中的 str_detectstr_which

由于 OP 有大量数据要处理,我进行了分析以比较来自基础 R、purrrstringr 的哪些函数组合可以完成相同的任务最少的时间。

我一共调查了八种组合。可以选择使用 sapplymap_intgrepstr_which,以及 greplstr_detect

数据准备

这里我根据OP的例子创建了两个数据框,femalefemale_df。请注意,我设置 stringsAsFactors 以确保每一整列都是字符格式。

# Create the example data frame: female
female <- data.frame(Var2 = c("myocardial infarction drug therapy", 
                              "imipramine poisoning",
                              "oximetry",
                              "thrombosis drug therapy",
                              "angioedema chemically induced"),
                     stringsAsFactors = FALSE)

# Create the example data frame: female_df
female_df <- data.frame(MH2 = c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
                                "angioedema chemically induced, angioedema chemically induced, oximetry",
                                "abo blood group system, imipramine poisoning, adverse effects",
                                "isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
                                "thrombosis drug therapy"),
                        stringsAsFactors = FALSE)

我还加载了所需的包。 microbenchmark是一个评估代码性能的包。

# Load packages
library(purrr)
library(stringr)
library(microbenchmark)

功能组合

这里是可以实现 OP 任务的函数组合列表。

组合1

这是来自 Luís Telles 的回答。它使用 sapplygrepl.

sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})

myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

组合2

这是来自Dave2e的回答。它使用 sapplygrep.

sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})

myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

组合3

这使用 map_intstr_detect

map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1

组合4

这使用 map_intstr_which

map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1

组合5

这使用 map_intgrepl

map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
[1] 1 1 2 2 1

组合6

这使用 map_intgrep

map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})
[1] 1 1 2 2 1

组合7

这使用 sapplystr_detect

sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

组合8

这使用 sapplystr_which

sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

所有这些组合都是有效答案。例如,我们可以 female$Count < 存储这些组合的任何结果。

微基准测试

这里我对这8个组合进行了30000次采样的benchmarking

m <- microbenchmark(
  C1 = {sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
  C2 = {sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})},
  C3 = {map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
  C4 = {map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
  C5 = {map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
  C6 = {map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})},
  C7 = {sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
  C8 = {sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
  times = 30000L
)

print(m)

Unit: microseconds
 expr     min      lq     mean   median       uq       max neval
   C1 166.144 200.784 1503.780 2192.261 2401.063 184228.81 30000
   C2 163.578 198.860 1420.937 1460.653 2280.465 144553.22 30000
   C3 189.238 231.575 1502.319  790.305 2386.309 146455.85 30000
   C4 200.784 246.329 1461.714 1224.909 2306.125 184189.04 30000
   C5 150.107 185.388 1452.586 1970.630 2376.687  32124.08 30000
   C6 148.824 184.105 1398.312 1921.556 2259.937 145843.88 30000
   C7 205.916 251.461 1516.979  851.246 2408.119 146305.10 30000
   C8 215.538 264.932 1481.538 1508.764 2324.727 229709.16 30000

所有这些组合的平均时间相似,但组合 3,即使用 map_intstr_detect,具有最低的中位数。