如何包含数据框中包含某些关键字的行
How do I include rows from dataframe that contain certain keywords
我正在为一项任务分析 reddit 线程,我只想包括包含特定关键字的线程。
我有一个关键字列表:keywords <- c(addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')
数据框有 3 列。我只想包括包含名为 title 的列中的关键字之一的行。
例如
title
created_utc
1
Anyone have a RH wallet yet? Asking for a friend
164128421
2
Ravi Menon, managing director of the Monetary Auth...
164131283
3
Different Augmented Reality(AR) NFT apps and marke...
164134123
keywordstest2<-paste0(keywords, collapse = "|")
dfsub%>% filter(grepl(keywordstest2,title))
试过了,obvs 没用。
有谁知道怎么做。谢谢 :D
这应该有效。
library(tidyverse)
dfsub %>%
filter(grepl('addict|addicted|addiction|addictive|afraid|anxiety|anxious|cry|crying|delusion|delusional', title))
你可以试试这个。我将示例扩展为包含 2 个关键字
编辑,正如 Merijn 在评论中提到的,添加词边界 \b
以排除误报,因为 grepl
进行部分匹配
library(dplyr)
keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")
df %>% filter(grepl(paste0("\b",paste(keywords,
collapse="\b|\b"),"\b"), df$title))
id title
1 1 Anyone have a RH wallet yet? Asking delusion for a friend
2 3 Different Augmented Reality(AR) NFT apps and anxiety and marke...
created_utc
1 164128421
2 164134123
数据
df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth... crypto", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))
这是另一个 tidyverse
选项。我将您的关键字折叠到一个可搜索列表中(例如,瘾君子或上瘾者或...)。然后,我在 title
上使用 str_detect
来查找这些关键字中的任何一个,如果是,则保留这些行(使用 filter
)。
library(tidyverse)
df %>%
filter(str_detect(title, paste(keywords, collapse = "|")))
或者base R,可以一行筛选:
df[grep(paste(keywords, collapse = "|"),df$title),]
输出
id title created_utc
1 1 Anyone have a RH wallet yet? Asking delusion for a friend 164128421
2 3 Different Augmented Reality(AR) NFT apps and anxiety and marke... 164134123
数据
df <-
structure(
list(
id = 1:3,
title = c(
"Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth...",
"Different Augmented Reality(AR) NFT apps and anxiety and marke..."
),
created_utc = c(164128421L, 164131283L, 164134123L)
),
class = "data.frame",
row.names = c(NA,-3L)
)
keywords <- c('addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')
要过滤 2 列,那么您可以这样做:
df %>%
filter(Reduce(`|`, across(
c(title, selftext), .fns = ~ str_detect(., paste(keywords, collapse = "|"))
)))
对于这个相对较小的关键字列表串联,|
是一个选项,但是当要匹配的字符串变得太大时,您 运行 会遇到问题。到目前为止,给出的答案也匹配基于关键字“cry”的“crypto”。我稍微调整了 df
以包含“crypto”一词。
df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary crypto Auth...", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))
keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")
library(stringr)
df %>%
group_by(id) %>%
filter(any(stri_trans_tolower(stri_extract_all_words(title)[[1]]) %in% keywords))
# # A tibble: 2 x 3
# # Groups: id [2]
# id title created_utc
# <int> <chr> <int>
# 1 1 Anyone have a RH wallet yet? Asking delusion for a friend 164128421
# 2 3 Different Augmented Reality(AR) NFT apps and anxiety and marke... 164134123
我正在为一项任务分析 reddit 线程,我只想包括包含特定关键字的线程。
我有一个关键字列表:keywords <- c(addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')
数据框有 3 列。我只想包括包含名为 title 的列中的关键字之一的行。
例如
title | created_utc | |
---|---|---|
1 | Anyone have a RH wallet yet? Asking for a friend | 164128421 |
2 | Ravi Menon, managing director of the Monetary Auth... | 164131283 |
3 | Different Augmented Reality(AR) NFT apps and marke... | 164134123 |
keywordstest2<-paste0(keywords, collapse = "|")
dfsub%>% filter(grepl(keywordstest2,title))
试过了,obvs 没用。
有谁知道怎么做。谢谢 :D
这应该有效。
library(tidyverse)
dfsub %>%
filter(grepl('addict|addicted|addiction|addictive|afraid|anxiety|anxious|cry|crying|delusion|delusional', title))
你可以试试这个。我将示例扩展为包含 2 个关键字
编辑,正如 Merijn 在评论中提到的,添加词边界 \b
以排除误报,因为 grepl
进行部分匹配
library(dplyr)
keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")
df %>% filter(grepl(paste0("\b",paste(keywords,
collapse="\b|\b"),"\b"), df$title))
id title
1 1 Anyone have a RH wallet yet? Asking delusion for a friend
2 3 Different Augmented Reality(AR) NFT apps and anxiety and marke...
created_utc
1 164128421
2 164134123
数据
df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth... crypto", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))
这是另一个 tidyverse
选项。我将您的关键字折叠到一个可搜索列表中(例如,瘾君子或上瘾者或...)。然后,我在 title
上使用 str_detect
来查找这些关键字中的任何一个,如果是,则保留这些行(使用 filter
)。
library(tidyverse)
df %>%
filter(str_detect(title, paste(keywords, collapse = "|")))
或者base R,可以一行筛选:
df[grep(paste(keywords, collapse = "|"),df$title),]
输出
id title created_utc
1 1 Anyone have a RH wallet yet? Asking delusion for a friend 164128421
2 3 Different Augmented Reality(AR) NFT apps and anxiety and marke... 164134123
数据
df <-
structure(
list(
id = 1:3,
title = c(
"Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth...",
"Different Augmented Reality(AR) NFT apps and anxiety and marke..."
),
created_utc = c(164128421L, 164131283L, 164134123L)
),
class = "data.frame",
row.names = c(NA,-3L)
)
keywords <- c('addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')
要过滤 2 列,那么您可以这样做:
df %>%
filter(Reduce(`|`, across(
c(title, selftext), .fns = ~ str_detect(., paste(keywords, collapse = "|"))
)))
对于这个相对较小的关键字列表串联,|
是一个选项,但是当要匹配的字符串变得太大时,您 运行 会遇到问题。到目前为止,给出的答案也匹配基于关键字“cry”的“crypto”。我稍微调整了 df
以包含“crypto”一词。
df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary crypto Auth...", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))
keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")
library(stringr)
df %>%
group_by(id) %>%
filter(any(stri_trans_tolower(stri_extract_all_words(title)[[1]]) %in% keywords))
# # A tibble: 2 x 3
# # Groups: id [2]
# id title created_utc
# <int> <chr> <int>
# 1 1 Anyone have a RH wallet yet? Asking delusion for a friend 164128421
# 2 3 Different Augmented Reality(AR) NFT apps and anxiety and marke... 164134123