如何包含数据框中包含某些关键字的行

Question

我正在为一项任务分析 reddit 线程，我只想包括包含特定关键字的线程。

我有一个关键字列表：keywords <- c(addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')

数据框有 3 列。我只想包括包含名为 title 的列中的关键字之一的行。

例如

	title	created_utc
1	Anyone have a RH wallet yet? Asking for a friend	164128421
2	Ravi Menon, managing director of the Monetary Auth...	164131283
3	Different Augmented Reality(AR) NFT apps and marke...	164134123

keywordstest2<-paste0(keywords, collapse = "|") dfsub%>% filter(grepl(keywordstest2,title))

试过了，obvs 没用。

有谁知道怎么做。谢谢 :D

Answer 1

这应该有效。

library(tidyverse)

dfsub %>% 
filter(grepl('addict|addicted|addiction|addictive|afraid|anxiety|anxious|cry|crying|delusion|delusional', title))

Answer 2

你可以试试这个。我将示例扩展为包含 2 个关键字

编辑，正如 Merijn 在评论中提到的，添加词边界 \b 以排除误报，因为 grepl 进行部分匹配

library(dplyr)

keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")

df %>% filter(grepl(paste0("\b",paste(keywords, 
  collapse="\b|\b"),"\b"), df$title))
  id                                                             title
1  1         Anyone have a RH wallet yet? Asking delusion for a friend
2  3 Different Augmented Reality(AR) NFT apps and anxiety and marke...
  created_utc
1   164128421
2   164134123

数据

df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth... crypto", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))

Answer 3

这是另一个 tidyverse 选项。我将您的关键字折叠到一个可搜索列表中（例如，瘾君子或上瘾者或...）。然后，我在 title 上使用 str_detect 来查找这些关键字中的任何一个，如果是，则保留这些行（使用 filter）。

library(tidyverse)

df %>% 
  filter(str_detect(title, paste(keywords, collapse = "|")))

或者base R，可以一行筛选：

df[grep(paste(keywords, collapse = "|"),df$title),]

输出

  id                                                             title created_utc
1  1         Anyone have a RH wallet yet? Asking delusion for a friend   164128421
2  3 Different Augmented Reality(AR) NFT apps and anxiety and marke...   164134123

数据

df <-
 structure(
   list(
     id = 1:3,
     title = c(
       "Anyone have a RH wallet yet? Asking delusion for a friend",
       "Ravi Menon managing director of the Monetary Auth...",
       "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
     ),
     created_utc = c(164128421L, 164131283L, 164134123L)
   ),
   class = "data.frame",
   row.names = c(NA,-3L)
 )

keywords <- c('addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')

要过滤 2 列，那么您可以这样做：

df %>%
  filter(Reduce(`|`, across(
    c(title, selftext), .fns = ~ str_detect(., paste(keywords, collapse = "|"))
  )))

Answer 4

对于这个相对较小的关键字列表串联，| 是一个选项，但是当要匹配的字符串变得太大时，您运行会遇到问题。到目前为止，给出的答案也匹配基于关键字“cry”的“crypto”。我稍微调整了 df 以包含“crypto”一词。

df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary crypto Auth...", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))

keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")

library(stringr)

df %>% 
  group_by(id) %>%
  filter(any(stri_trans_tolower(stri_extract_all_words(title)[[1]]) %in% keywords))

# # A tibble: 2 x 3
# # Groups:   id [2]
#      id title                                                             created_utc
#   <int> <chr>                                                                   <int>
# 1     1 Anyone have a RH wallet yet? Asking delusion for a friend           164128421
# 2     3 Different Augmented Reality(AR) NFT apps and anxiety and marke...   164134123

如何包含数据框中包含某些关键字的行

How do I include rows from dataframe that contain certain keywords

r

reddit

filter

dplyr

数据