如何包含数据框中包含某些关键字的行

How do I include rows from dataframe that contain certain keywords

我正在为一项任务分析 reddit 线程,我只想包括包含特定关键字的线程。

我有一个关键字列表:keywords <- c(addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')

数据框有 3 列。我只想包括包含名为 title 的列中的关键字之一的行。

例如

title created_utc
1 Anyone have a RH wallet yet? Asking for a friend 164128421
2 Ravi Menon, managing director of the Monetary Auth... 164131283
3 Different Augmented Reality(AR) NFT apps and marke... 164134123

keywordstest2<-paste0(keywords, collapse = "|") dfsub%>% filter(grepl(keywordstest2,title))

试过了,obvs 没用。

有谁知道怎么做。谢谢 :D

这应该有效。

library(tidyverse)

dfsub %>% 
filter(grepl('addict|addicted|addiction|addictive|afraid|anxiety|anxious|cry|crying|delusion|delusional', title))

你可以试试这个。我将示例扩展为包含 2 个关键字

编辑,正如 Merijn 在评论中提到的,添加词边界 \b 以排除误报,因为 grepl 进行部分匹配

library(dplyr)

keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")

df %>% filter(grepl(paste0("\b",paste(keywords, 
  collapse="\b|\b"),"\b"), df$title))
  id                                                             title
1  1         Anyone have a RH wallet yet? Asking delusion for a friend
2  3 Different Augmented Reality(AR) NFT apps and anxiety and marke...
  created_utc
1   164128421
2   164134123

数据

df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth... crypto", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))

这是另一个 tidyverse 选项。我将您的关键字折叠到一个可搜索列表中(例如,瘾君子或上瘾者或...)。然后,我在 title 上使用 str_detect 来查找这些关键字中的任何一个,如果是,则保留这些行(使用 filter)。

library(tidyverse)

df %>% 
  filter(str_detect(title, paste(keywords, collapse = "|")))

或者base R,可以一行筛选:

df[grep(paste(keywords, collapse = "|"),df$title),]

输出

  id                                                             title created_utc
1  1         Anyone have a RH wallet yet? Asking delusion for a friend   164128421
2  3 Different Augmented Reality(AR) NFT apps and anxiety and marke...   164134123

数据

df <-
 structure(
   list(
     id = 1:3,
     title = c(
       "Anyone have a RH wallet yet? Asking delusion for a friend",
       "Ravi Menon managing director of the Monetary Auth...",
       "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
     ),
     created_utc = c(164128421L, 164131283L, 164134123L)
   ),
   class = "data.frame",
   row.names = c(NA,-3L)
 )

keywords <- c('addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')

要过滤 2 列,那么您可以这样做:

df %>%
  filter(Reduce(`|`, across(
    c(title, selftext), .fns = ~ str_detect(., paste(keywords, collapse = "|"))
  )))

对于这个相对较小的关键字列表串联,| 是一个选项,但是当要匹配的字符串变得太大时,您 运行 会遇到问题。到目前为止,给出的答案也匹配基于关键字“cry”的“crypto”。我稍微调整了 df 以包含“crypto”一词。

df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary crypto Auth...", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))

keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")

library(stringr)

df %>% 
  group_by(id) %>%
  filter(any(stri_trans_tolower(stri_extract_all_words(title)[[1]]) %in% keywords))

# # A tibble: 2 x 3
# # Groups:   id [2]
#      id title                                                             created_utc
#   <int> <chr>                                                                   <int>
# 1     1 Anyone have a RH wallet yet? Asking delusion for a friend           164128421
# 2     3 Different Augmented Reality(AR) NFT apps and anxiety and marke...   164134123