使用 dplyr 和 stringr 替换所有以开头的值
Using dplyr and stringr to replace all values starts with
我的 df
> df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100))
> df
food sold
1 fruit banana 99.47171
2 fruit apple 99.40878
3 fruit grape 99.28727
4 bread 99.15934
5 meat 100.53438
现在我想替换食品中以 "fruit" 开头的所有值,然后按食品分组并用售出金额汇总售出。
> df %>%
+ mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>%
+ group_by(food) %>%
+ summarise(sold = sum(sold))
Source: local data frame [3 x 2]
food sold
(fctr) (dbl)
1 bread 99.15934
2 meat 100.53438
3 NA 298.16776
为什么这个命令不起作用?它给我 NA 而不是水果?
replace
未按预期工作,因为列 food
是因子变量而 fruit
是未知水平。
一种可能的解决方案是使用正确的因子水平
定义数据框列food
df <- data.frame(food =
factor(c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
levels =c("fruit banana", "fruit apple", "fruit grape", "bread", "meat", "fruit") ),
sold = rnorm(5, 100))
当然更容易设置 stringsAsFactors = FALSE
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
sold = rnorm(5, 100),
stringsAsFactors = FALSE)
它对我有用,我认为你的数据是因数:
在制作数据时使用 stringsAsFactors=FALSE
,或者您可以在 R 环境中 运行 options(stringsAsFactors=FALSE)
来避免相同的情况:
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100),stringsAsFactors = FALSE)
df %>%
mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>%
group_by(food) %>%
summarise(sold = sum(sold))
输出:
# A tibble: 3 × 2
food sold
<chr> <dbl>
1 bread 99.67661
2 fruit 300.28520
3 meat 99.88566
我们可以使用 base R
来做到这一点,而无需转换为 character
class,方法是将带有 'fruit' 的 levels
分配给 'fruit' 并使用aggregate
得到 sum
levels(df$food)[grepl("fruit", levels(df$food))] <- "fruit"
aggregate(sold~food, df, sum)
# food sold
#1 bread 99.41637
#2 fruit 300.41033
#3 meat 100.84746
数据
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape",
"bread", "meat"), sold = rnorm(5, 100))
尽管 Q 被标记为 dplyr
和 stringr
我想提出一个使用 data.table
的替代解决方案,因为 data.table
以方便和直接的方式处理因素方式:
library(data.table)
setDT(df)[food %like% "^fruit", food := "fruit"][, .(sold = sum(sold)), by = food]
# food sold
#1: fruit 300.41033
#2: bread 99.41637
#3: meat 100.84746
数据
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
sold = rnorm(5, 100))
这里有两个替代解决方案,它们使用 forcats
、stringr
和正则表达式来直接操纵因子水平。
如果我没理解错的话,这个问题是由于 food
是 replace()
没有妥善处理的一个因素造成的。
1。 fct_collapse()
fct_collapse()
函数用于将所有以 "fruit "
开头的因子水平(注意尾随空白)折叠为因子水平“水果”:
library(dplyr)
library(stringr)
library(forcats)
df %>%
group_by(food = fct_collapse(food, fruit = levels(food) %>% str_subset("^fruit "))) %>%
summarise(sold = sum(sold))
food sold
<fct> <dbl>
1 bread 99.4
2 egg fruits 100.
3 fruit 300.
4 fruity wine 100.
5 meat 101.
请注意,使用了增强的样本数据集,其中包括边缘情况以更好地测试正则表达式。此外,分组变量直接在 group_by()
中计算,这节省了预先调用 mutate()
.
2。 str_replace()
后视
有一个更短的解决方案,它使用 str_replace()
而不是 replace()
以及更复杂的正则表达式。正则表达式使用 look-behind 来删除前导 "fruit"
之后的所有字符(包括“fruit”之后的空格):
df %>%
group_by(food = str_replace(food, "(?<=^fruit)( .*)", "")) %>%
summarise(sold = sum(sold))
结果同上
增强数据样本集
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread",
"meat", "egg fruits", "fruity wine"),
sold = rnorm(7, 100))
df
food sold
1 fruit banana 99.45412
2 fruit apple 100.53659
3 fruit grape 100.41962
4 bread 99.41637
5 meat 100.84746
6 egg fruits 100.26602
7 fruity wine 100.44459
我的 df
> df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100))
> df
food sold
1 fruit banana 99.47171
2 fruit apple 99.40878
3 fruit grape 99.28727
4 bread 99.15934
5 meat 100.53438
现在我想替换食品中以 "fruit" 开头的所有值,然后按食品分组并用售出金额汇总售出。
> df %>%
+ mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>%
+ group_by(food) %>%
+ summarise(sold = sum(sold))
Source: local data frame [3 x 2]
food sold
(fctr) (dbl)
1 bread 99.15934
2 meat 100.53438
3 NA 298.16776
为什么这个命令不起作用?它给我 NA 而不是水果?
replace
未按预期工作,因为列 food
是因子变量而 fruit
是未知水平。
一种可能的解决方案是使用正确的因子水平
定义数据框列food
df <- data.frame(food =
factor(c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
levels =c("fruit banana", "fruit apple", "fruit grape", "bread", "meat", "fruit") ),
sold = rnorm(5, 100))
当然更容易设置 stringsAsFactors = FALSE
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
sold = rnorm(5, 100),
stringsAsFactors = FALSE)
它对我有用,我认为你的数据是因数:
在制作数据时使用 stringsAsFactors=FALSE
,或者您可以在 R 环境中 运行 options(stringsAsFactors=FALSE)
来避免相同的情况:
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100),stringsAsFactors = FALSE)
df %>%
mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>%
group_by(food) %>%
summarise(sold = sum(sold))
输出:
# A tibble: 3 × 2
food sold
<chr> <dbl>
1 bread 99.67661
2 fruit 300.28520
3 meat 99.88566
我们可以使用 base R
来做到这一点,而无需转换为 character
class,方法是将带有 'fruit' 的 levels
分配给 'fruit' 并使用aggregate
得到 sum
levels(df$food)[grepl("fruit", levels(df$food))] <- "fruit"
aggregate(sold~food, df, sum)
# food sold
#1 bread 99.41637
#2 fruit 300.41033
#3 meat 100.84746
数据
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape",
"bread", "meat"), sold = rnorm(5, 100))
尽管 Q 被标记为 dplyr
和 stringr
我想提出一个使用 data.table
的替代解决方案,因为 data.table
以方便和直接的方式处理因素方式:
library(data.table)
setDT(df)[food %like% "^fruit", food := "fruit"][, .(sold = sum(sold)), by = food]
# food sold
#1: fruit 300.41033
#2: bread 99.41637
#3: meat 100.84746
数据
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
sold = rnorm(5, 100))
这里有两个替代解决方案,它们使用 forcats
、stringr
和正则表达式来直接操纵因子水平。
如果我没理解错的话,这个问题是由于 food
是 replace()
没有妥善处理的一个因素造成的。
1。 fct_collapse()
fct_collapse()
函数用于将所有以 "fruit "
开头的因子水平(注意尾随空白)折叠为因子水平“水果”:
library(dplyr)
library(stringr)
library(forcats)
df %>%
group_by(food = fct_collapse(food, fruit = levels(food) %>% str_subset("^fruit "))) %>%
summarise(sold = sum(sold))
food sold <fct> <dbl> 1 bread 99.4 2 egg fruits 100. 3 fruit 300. 4 fruity wine 100. 5 meat 101.
请注意,使用了增强的样本数据集,其中包括边缘情况以更好地测试正则表达式。此外,分组变量直接在 group_by()
中计算,这节省了预先调用 mutate()
.
2。 str_replace()
后视
有一个更短的解决方案,它使用 str_replace()
而不是 replace()
以及更复杂的正则表达式。正则表达式使用 look-behind 来删除前导 "fruit"
之后的所有字符(包括“fruit”之后的空格):
df %>%
group_by(food = str_replace(food, "(?<=^fruit)( .*)", "")) %>%
summarise(sold = sum(sold))
结果同上
增强数据样本集
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread",
"meat", "egg fruits", "fruity wine"),
sold = rnorm(7, 100))
df
food sold 1 fruit banana 99.45412 2 fruit apple 100.53659 3 fruit grape 100.41962 4 bread 99.41637 5 meat 100.84746 6 egg fruits 100.26602 7 fruity wine 100.44459