使用 dplyr 和 stringr 替换所有以开头的值

Using dplyr and stringr to replace all values starts with

我的 df

> df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100))
>   df
          food      sold
1 fruit banana  99.47171
2  fruit apple  99.40878
3  fruit grape  99.28727
4        bread  99.15934
5         meat 100.53438

现在我想替换食品中以 "fruit" 开头的所有值,然后按食品分组并用售出金额汇总售出。

> df %>%
+     mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>% 
+     group_by(food) %>% 
+     summarise(sold = sum(sold))
Source: local data frame [3 x 2]

    food      sold
  (fctr)     (dbl)
1  bread  99.15934
2   meat 100.53438
3     NA 298.16776

为什么这个命令不起作用?它给我 NA 而不是水果?

replace 未按预期工作,因为列 food 是因子变量而 fruit 是未知水平。

一种可能的解决方案是使用正确的因子水平

定义数据框列food
df <- data.frame(food = 
  factor(c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), 
    levels =c("fruit banana", "fruit apple", "fruit grape", "bread", "meat", "fruit") ), 
    sold = rnorm(5, 100))

当然更容易设置 stringsAsFactors = FALSE

df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
             sold = rnorm(5, 100), 
             stringsAsFactors = FALSE)

它对我有用,我认为你的数据是因数:

在制作数据时使用 stringsAsFactors=FALSE,或者您可以在 R 环境中 运行 options(stringsAsFactors=FALSE) 来避免相同的情况:

df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100),stringsAsFactors = FALSE)

df %>%
mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>% 
group_by(food) %>% 
summarise(sold = sum(sold))

输出:

 # A tibble: 3 × 2
       food      sold
      <chr>     <dbl>
    1 bread  99.67661
    2 fruit 300.28520
    3  meat  99.88566

我们可以使用 base R 来做到这一点,而无需转换为 character class,方法是将带有 'fruit' 的 levels 分配给 'fruit' 并使用aggregate 得到 sum

levels(df$food)[grepl("fruit", levels(df$food))] <- "fruit"
aggregate(sold~food, df, sum)
#   food      sold
#1 bread  99.41637
#2 fruit 300.41033
#3  meat 100.84746

数据

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", 
                 "bread", "meat"), sold = rnorm(5, 100))

尽管 Q 被标记为 dplyrstringr 我想提出一个使用 data.table 的替代解决方案,因为 data.table 以方便和直接的方式处理因素方式:

library(data.table)
setDT(df)[food %like% "^fruit", food := "fruit"][, .(sold = sum(sold)), by = food]
#    food      sold
#1: fruit 300.41033
#2: bread  99.41637
#3:  meat 100.84746

数据

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), 
                 sold = rnorm(5, 100))

这里有两个替代解决方案,它们使用 forcatsstringr 和正则表达式来直接操纵因子水平。

如果我没理解错的话,这个问题是由于 foodreplace() 没有妥善处理的一个因素造成的。

1。 fct_collapse()

fct_collapse() 函数用于将所有以 "fruit " 开头的因子水平(注意尾随空白)折叠为因子水平“水果”:

library(dplyr)
library(stringr)
library(forcats)
df %>%
  group_by(food = fct_collapse(food, fruit = levels(food) %>% str_subset("^fruit "))) %>% 
  summarise(sold = sum(sold))
  food         sold
  <fct>       <dbl>
1 bread        99.4
2 egg fruits  100. 
3 fruit       300. 
4 fruity wine 100. 
5 meat        101.

请注意,使用了增强的样本数据集,其中包括边缘情况以更好地测试正则表达式。此外,分组变量直接在 group_by() 中计算,这节省了预先调用 mutate().

2。 str_replace() 后视

有一个更短的解决方案,它使用 str_replace() 而不是 replace() 以及更复杂的正则表达式。正则表达式使用 look-behind 来删除前导 "fruit" 之后的所有字符(包括“fruit”之后的空格):

df %>%
  group_by(food = str_replace(food, "(?<=^fruit)( .*)", "")) %>% 
  summarise(sold = sum(sold))

结果同上

增强数据样本集

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", 
                          "meat", "egg fruits", "fruity wine"), 
                 sold = rnorm(7, 100))
df
          food      sold
1 fruit banana  99.45412
2  fruit apple 100.53659
3  fruit grape 100.41962
4        bread  99.41637
5         meat 100.84746
6   egg fruits 100.26602
7  fruity wine 100.44459