R:如何提高 grepl 在 dataframe 中应用函数的性能
R: How to improve performance of grepl in apply function within dataframe
我有以下列的数据框:
country<- c("CA","IN","US")
text <- c("paint red green", "painting red", "painting blue")
word <- c("green, red, blue", "red", "red, blue")
df <- data.frame(country, text, word)
对于每一行,我想在文本列的文本中找到单词列中的单词,并将它们呈现在一个新的列中,这样就会显示文本中创建的单词,以逗号分隔。
所以新列应该是:
df$new_col <- c("green,red","red","blue")
我正在使用这几行代码,但是它需要很长时间才能运行甚至崩溃。
setDT(df)[, new_col:= paste(df$word[unlist(lapply(df$word,function(x) grepl(x, df$text,
ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
有没有办法更改代码以提高效率?
非常感谢!
试试这个
mapply(function(x,y){paste(intersect(x,y),collapse=", ")},
strsplit(as.character(df$text),"\, | "),
strsplit(as.character(df$word),"\, | "))
[1] "red, green" "red" "blue"
library(tidyverse)
df %>%
mutate(newcol = stringr::str_extract_all(text,gsub(", +","|",word)))
country text word newcol
1 CA paint red green green, red, blue red, green
2 IN painting red red red
3 US painting blue red, blue blue
在这种情况下,newcol
是一个列表。要使它成为一个字符串,我们可以这样做:
df%>%
mutate(newcol = text %>%
str_extract_all(gsub(", +", "|", word)) %>%
invoke(toString, .))
使用 data.table,您可以:
df[,id := .I][,newcol := do.call(toString,str_extract_all(text,gsub(', +',"|",word))),
by = id][, id := NULL][]
country text word newcol
1: CA paint red green green, red, blue red, green
2: IN painting red red red
3: US painting blue red, blue blue
另一个使用 mapply
+ grep
+ regmatches
的基础 R 解决方案,即
df <- within(df, newcol <- mapply(function(x,y) toString(grep(x,y,value = TRUE)),
gsub("\W+","|",word),
regmatches(text,gregexpr("\w+",text))))
这样
> df
country text word newcol
1 CA paint red green green, red, blue red, green
2 IN painting red red red
3 US painting blue red, blue blue
我有以下列的数据框:
country<- c("CA","IN","US")
text <- c("paint red green", "painting red", "painting blue")
word <- c("green, red, blue", "red", "red, blue")
df <- data.frame(country, text, word)
对于每一行,我想在文本列的文本中找到单词列中的单词,并将它们呈现在一个新的列中,这样就会显示文本中创建的单词,以逗号分隔。 所以新列应该是:
df$new_col <- c("green,red","red","blue")
我正在使用这几行代码,但是它需要很长时间才能运行甚至崩溃。
setDT(df)[, new_col:= paste(df$word[unlist(lapply(df$word,function(x) grepl(x, df$text,
ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
有没有办法更改代码以提高效率?
非常感谢!
试试这个
mapply(function(x,y){paste(intersect(x,y),collapse=", ")},
strsplit(as.character(df$text),"\, | "),
strsplit(as.character(df$word),"\, | "))
[1] "red, green" "red" "blue"
library(tidyverse)
df %>%
mutate(newcol = stringr::str_extract_all(text,gsub(", +","|",word)))
country text word newcol
1 CA paint red green green, red, blue red, green
2 IN painting red red red
3 US painting blue red, blue blue
在这种情况下,newcol
是一个列表。要使它成为一个字符串,我们可以这样做:
df%>%
mutate(newcol = text %>%
str_extract_all(gsub(", +", "|", word)) %>%
invoke(toString, .))
使用 data.table,您可以:
df[,id := .I][,newcol := do.call(toString,str_extract_all(text,gsub(', +',"|",word))),
by = id][, id := NULL][]
country text word newcol
1: CA paint red green green, red, blue red, green
2: IN painting red red red
3: US painting blue red, blue blue
另一个使用 mapply
+ grep
+ regmatches
的基础 R 解决方案,即
df <- within(df, newcol <- mapply(function(x,y) toString(grep(x,y,value = TRUE)),
gsub("\W+","|",word),
regmatches(text,gregexpr("\w+",text))))
这样
> df
country text word newcol
1 CA paint red green green, red, blue red, green
2 IN painting red red red
3 US painting blue red, blue blue