使用 dplyr 过滤包含部分列字符串的行

Question

假设我有一个像

这样的数据框

term     cnt
apple     10
apples     5
a apple on 3
blue pears 3
pears      1

如何过滤此列中所有部分找到的字符串，例如结果

term     cnt
apple     10
pears      1

没有指明我要过滤哪些术语（苹果|梨），而是通过自引用方式（即它会根据整列检查每个术语并删除部分匹配的术语）。标记的数量不受限制，字符串的一致性也不受限制（即 "mapples" 会被 "apple" 匹配）。这将导致

的反向广义基于 dplyr 的版本

d[grep("^apple$|^pears$", d$term), ]

此外，使用这种分离来获得累计总和会很有趣，例如

term     cnt
apple     18
pears      4

我无法让它与 contains() 或 grep() 一起工作。

谢谢

Answer 1

你可以尝试使用tidyverse类似

的东西

1. define a list of the words as:

     k <- dft %>% 
          select(term) %>% 
          unlist() %>% 
          unique()

2. operate on the data as:

    dft %>%
      separate(term, c('t1', 't2')) %>%
      rowwise() %>%
      mutate( g = sum(t1 %in% k)) %>%
      filter( g > 0) %>%
      select(t1, cnt)

给出：

      t1   cnt
   <chr> <int>
1  apple    10
2 apples     5
3  pears     1

这仍然无法处理 apple 和 apples。会继续努力的。

Answer 2

希望是完整的答案。不是很地道（作为 Pythonista 的称呼）但有人可以对此提出改进建议：

> ssss <- data.frame(c('apple','red apple','apples','pears','blue pears'),c(15,3,10,4,3))
> 
> names(ssss) <- c('Fruit','Count')
> 
> ssss
       Fruit Count
1      apple    15
2  red apple     3
3     apples    10
4      pears     4
5 blue pears     3
> 
> root_list <- as.vector(ssss$Fruit[unlist(lapply(ssss$Fruit,function(x){length(grep(x,ssss$Fruit))>1}))])
> 
> 
> ssss %>% filter(ssss$Fruit %in% root_list)
  Fruit Count
1 apple    15
2 pears     4
> 
> data <- data.frame(lapply(root_list, function(x){y <- stringr::str_extract(ssss$Fruit,x); ifelse(is.na(y),'',y)}))
> 
> cols <- colnames(data)
> 
> #data$x <- do.call(paste0, c(data[cols]))
> #for (co in cols) data[co] <- NULL
> 
> ssss$Fruit <- do.call(paste0, c(data[cols]))
> 
> ssss %>% group_by(Fruit) %>% summarise(val = sum(Count))
# A tibble: 2 x 2
  Fruit   val
  <chr> <dbl>
1 apple    28
2 pears     7
>

Answer 3

试试这个：

df=data.frame(term=c('apple','apples','a apple on','blue pears','pears'),cnt=c(10,5,3,3,1))

matches = sapply(df$term,function(t,terms){grepl(pattern = t,x = terms)},df$term)

sapply(1:ncol(matches),function(t,mat){
  tempmat = mat[,t]&mat[,-t]
  indices=unlist(apply(tempmat,MARGIN = 2,which))
  df$term[indices]<<-df$term[t]
 },matches)

df%>%group_by(term)%>%summarize(cnt=sum(cnt))

 # A tibble: 2 x 2
 #  term   cnt
 #  <chr> <dbl>
 #1 apple    18
 #2 pears     4

使用 dplyr 过滤包含部分列字符串的行

Using dplyr to filter rows which contain partial string of column

r

filter

dplyr

summarize