从另一个 df 中的字符串检测一个 df 中的多个字符串,如果检测到,return 检测到的字符串
Detect multiple strings in one df from strings in another df and, if detected, return the strings detected
我正在学习使用R,所以请多多包涵。
我有一个 google Play 商店应用程序 (master_tib) 的数据集。每行都是一个 Play 商店应用。有一个标题为描述的列,其中包含有关应用程序功能的文本。
master_tib
App Description
App1 Reduce your depression and anxiety
App2 Help your depression
App3 This app helps with Anxiety
App4 Dog walker app 3000
我还有一个 df 标签 (master_tags),其中包含我预定义的重要词。有一列标题为标签,每行包含一个标签。
master_tag
Tag
Depression
Anxiety
Stress
Mood
我的目标是根据描述中标签的存在,使用 master_tags df 中的标签标记 master_tib df 中的应用程序。然后它将在新列中打印标签。
最终结果将是 master_tib df,如下所示:
App Description Tag
App1 Reduce your depression and anxiety depression, anxiety
App2 Help your depression depression
App3 This app helps with anxiety anxiety
App4 Dog walker app 3000 FALSE
以下是我到目前为止使用 str_detect 和 mapply 的组合所做的:
# define function to use in mapply
detect_tag <- function(description, tag){
if(str_detect(description, tag, FALSE)) {
return (tag)
} else {
return (FALSE)
}
}
index <- mapply(FUN = detect_tag, description = master_tib$description, master_tags$tag)
master_tib[index,]
很遗憾,只有第一个标签被传递。
App Description Tag
App1 Reduce your depression and anxiety depression
而不是所需的:
App Description Tag
App1 Reduce your depression and anxiety depression, anxiety
我还没有将结果打印到新的专栏中。很想听听任何人的见解或想法,并为我糟糕的 R 技能提前道歉。
您可以使用 str_c
组合 master_tag
中的单词,然后使用 str_extract_all
获得所有匹配该模式的单词。
library(stringr)
master_tib$Tag <- sapply(str_extract_all(tolower(master_tib$Description),
str_c('\b', tolower(master_tag$Tag), '\b', collapse = "|")),
function(x) toString(unique(x)))
master_tib$Tag
#[1] "depression, anxiety" "depression" "anxiety" ""
数据
master_tag <- structure(list(Tag = c("Depression", "Anxiety", "Stress", "Mood"
)), class = "data.frame", row.names = c(NA, -4L))
master_tib <- structure(list(App = c("App1 ", "App2 ", "App3 ", "App4 "
), Description = c("Reduce your depression and anxiety", "Help your depression",
"This app helps with Anxiety", "Dog walker app 3000")), row.names = c(NA,
-4L), class = "data.frame")
与@RonakShah 的回答类似,但基数为 R:
apply(
sapply(master_tag$Tag, grepl, master_tib$Description, ignore.case = TRUE),
1, function(a) paste(master_tag$Tag[a], collapse = ","))
# [1] "Depression,Anxiety" "Depression" "Anxiety"
# [4] ""
(并且没有小写或 "comma-space" 细节,如果需要可以轻松添加)。
使用来自 tidyverse
(dplyr
、stringr
、tidyr
)的几个包和@Ronak Shah 的回答中显示的数据。
先将标签转化为模式:
pattern <- master_tags$Tag %>%
tolower() %>%
str_c(collapse="|")
然后找到所有匹配项并创建所需的输出:
master_tib %>%
mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
unnest(Tag, keep_empty = TRUE) %>%
group_by(App, Description) %>%
summarise(Tag = str_c(Tag, collapse=", "))
这会产生
# A tibble: 4 x 3
# Groups: App [4]
App Description Tag
<chr> <chr> <chr>
1 App1 Reduce your depression and anxiety depression, anxiety
2 App2 Help your depression depression
3 App3 This app helps with Anxiety anxiety
4 App4 Dog walker app 3000 NA
我正在学习使用R,所以请多多包涵。
我有一个 google Play 商店应用程序 (master_tib) 的数据集。每行都是一个 Play 商店应用。有一个标题为描述的列,其中包含有关应用程序功能的文本。
master_tib
App Description
App1 Reduce your depression and anxiety
App2 Help your depression
App3 This app helps with Anxiety
App4 Dog walker app 3000
我还有一个 df 标签 (master_tags),其中包含我预定义的重要词。有一列标题为标签,每行包含一个标签。
master_tag
Tag
Depression
Anxiety
Stress
Mood
我的目标是根据描述中标签的存在,使用 master_tags df 中的标签标记 master_tib df 中的应用程序。然后它将在新列中打印标签。 最终结果将是 master_tib df,如下所示:
App Description Tag
App1 Reduce your depression and anxiety depression, anxiety
App2 Help your depression depression
App3 This app helps with anxiety anxiety
App4 Dog walker app 3000 FALSE
以下是我到目前为止使用 str_detect 和 mapply 的组合所做的:
# define function to use in mapply
detect_tag <- function(description, tag){
if(str_detect(description, tag, FALSE)) {
return (tag)
} else {
return (FALSE)
}
}
index <- mapply(FUN = detect_tag, description = master_tib$description, master_tags$tag)
master_tib[index,]
很遗憾,只有第一个标签被传递。
App Description Tag
App1 Reduce your depression and anxiety depression
而不是所需的:
App Description Tag
App1 Reduce your depression and anxiety depression, anxiety
我还没有将结果打印到新的专栏中。很想听听任何人的见解或想法,并为我糟糕的 R 技能提前道歉。
您可以使用 str_c
组合 master_tag
中的单词,然后使用 str_extract_all
获得所有匹配该模式的单词。
library(stringr)
master_tib$Tag <- sapply(str_extract_all(tolower(master_tib$Description),
str_c('\b', tolower(master_tag$Tag), '\b', collapse = "|")),
function(x) toString(unique(x)))
master_tib$Tag
#[1] "depression, anxiety" "depression" "anxiety" ""
数据
master_tag <- structure(list(Tag = c("Depression", "Anxiety", "Stress", "Mood"
)), class = "data.frame", row.names = c(NA, -4L))
master_tib <- structure(list(App = c("App1 ", "App2 ", "App3 ", "App4 "
), Description = c("Reduce your depression and anxiety", "Help your depression",
"This app helps with Anxiety", "Dog walker app 3000")), row.names = c(NA,
-4L), class = "data.frame")
与@RonakShah 的回答类似,但基数为 R:
apply(
sapply(master_tag$Tag, grepl, master_tib$Description, ignore.case = TRUE),
1, function(a) paste(master_tag$Tag[a], collapse = ","))
# [1] "Depression,Anxiety" "Depression" "Anxiety"
# [4] ""
(并且没有小写或 "comma-space" 细节,如果需要可以轻松添加)。
使用来自 tidyverse
(dplyr
、stringr
、tidyr
)的几个包和@Ronak Shah 的回答中显示的数据。
先将标签转化为模式:
pattern <- master_tags$Tag %>%
tolower() %>%
str_c(collapse="|")
然后找到所有匹配项并创建所需的输出:
master_tib %>%
mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
unnest(Tag, keep_empty = TRUE) %>%
group_by(App, Description) %>%
summarise(Tag = str_c(Tag, collapse=", "))
这会产生
# A tibble: 4 x 3
# Groups: App [4]
App Description Tag
<chr> <chr> <chr>
1 App1 Reduce your depression and anxiety depression, anxiety
2 App2 Help your depression depression
3 App3 This app helps with Anxiety anxiety
4 App4 Dog walker app 3000 NA