使用dplyr 汇总并保持相同的变量名
Using dplyr to summarize and keep the same variable name
我发现 data.table 和 dplyr 在尝试做同样的事情时有不同的结果。我想使用 dplyr 语法,但让它按照 data.table 的方式计算。用例是我想将小计添加到 table。为此,我需要对每个变量进行一些聚合,但随后保留相同的变量名称(在转换后的版本中)。 Data.table 允许我对变量执行一些聚合并保持相同的名称。然后用同一个变量做另一个聚合。它将继续使用未转换的版本。但是,Dplyr 将使用转换后的版本。
在 summarize 文档中说:
# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
这基本上是我 运行 遇到的问题,但我想知道是否有很好的解决方法。我发现的一件事是将转换后的变量命名为其他名称,然后在最后 rename 它,但这对我来说看起来不太好。如果有做小计的好方法,那也很好知道。我环顾了这个网站,没有看到讨论过这个确切的情况。任何帮助将不胜感激!
这里我做了一个简单的例子,一次是data.table的结果,一次是dplyr的。我想采用这个简单的 table 并附加一个小计行,该行是感兴趣列(总计)的加权平均值。
library(data.table)
library(dplyr)
dt <- data.table(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]
dt <- rbind(dt[, .(Group, Count_Dist, Total)],
dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")
dt
df <- data.frame(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Count_Dist*Total))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Count_Dist*Total))) %>%
rename(Weighted_Avg_Total = Total)
再次感谢您的帮助!
一个可能的解决方案是跳过 mutate
步骤并在第一个 mutate
/select
步骤中使用 transmute
并直接从原始计算所需的变量不为第二个 mutate
步创建中间变量的变量:
df %>%
transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
给出:
Group Count_Dist Weighted_Avg_Total
1 A 0.09345794 50.0000
2 B 0.14018692 300.0000
3 C 0.11214953 600.0000
4 D 0.18691589 400.0000
5 E 0.46728972 1000.0000
6 All 1.00000000 656.0748
另一种可能的解决方案是更改在 dplyr
中计算新变量的顺序,然后使用 select
将列顺序恢复为您最初想要的顺序:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Weighted_Avg_Total = sum(Count_Dist*Total),
Count_Dist = sum(Count_Dist)) %>%
select(Group, Count_Dist, Weighted_Avg_Total))
如果您还想包含 Count
列,您可以这样做(根据我在下面的评论):
df %>%
transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total),
Count = sum(Count)))
一种替代方法是使用 mutate
两次计算 Weighted_Total
并在 summarize
.
中使用该列的 sum
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
mutate(Weighted_Total = Count_Dist*Total) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Weighted_Total))
Result:
Count_Dist Weighted_Total
1 1 656.074766
并且:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
mutate(Weighted_Total = Count_Dist*Total) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Weighted_Total))) %>%
rename(Weighted_Avg_Total = Total)
Result:
Group Count_Dist Weighted_Avg_Total
1 A 0.0934579439 50.000000
2 B 0.1401869159 300.000000
3 C 0.1121495327 600.000000
4 D 0.1869158879 400.000000
5 E 0.4672897196 1000.000000
6 All 1.0000000000 656.074766
我发现 data.table 和 dplyr 在尝试做同样的事情时有不同的结果。我想使用 dplyr 语法,但让它按照 data.table 的方式计算。用例是我想将小计添加到 table。为此,我需要对每个变量进行一些聚合,但随后保留相同的变量名称(在转换后的版本中)。 Data.table 允许我对变量执行一些聚合并保持相同的名称。然后用同一个变量做另一个聚合。它将继续使用未转换的版本。但是,Dplyr 将使用转换后的版本。
在 summarize 文档中说:
# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
这基本上是我 运行 遇到的问题,但我想知道是否有很好的解决方法。我发现的一件事是将转换后的变量命名为其他名称,然后在最后 rename 它,但这对我来说看起来不太好。如果有做小计的好方法,那也很好知道。我环顾了这个网站,没有看到讨论过这个确切的情况。任何帮助将不胜感激!
这里我做了一个简单的例子,一次是data.table的结果,一次是dplyr的。我想采用这个简单的 table 并附加一个小计行,该行是感兴趣列(总计)的加权平均值。
library(data.table)
library(dplyr)
dt <- data.table(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]
dt <- rbind(dt[, .(Group, Count_Dist, Total)],
dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")
dt
df <- data.frame(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Count_Dist*Total))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Count_Dist*Total))) %>%
rename(Weighted_Avg_Total = Total)
再次感谢您的帮助!
一个可能的解决方案是跳过 mutate
步骤并在第一个 mutate
/select
步骤中使用 transmute
并直接从原始计算所需的变量不为第二个 mutate
步创建中间变量的变量:
df %>%
transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
给出:
Group Count_Dist Weighted_Avg_Total 1 A 0.09345794 50.0000 2 B 0.14018692 300.0000 3 C 0.11214953 600.0000 4 D 0.18691589 400.0000 5 E 0.46728972 1000.0000 6 All 1.00000000 656.0748
另一种可能的解决方案是更改在 dplyr
中计算新变量的顺序,然后使用 select
将列顺序恢复为您最初想要的顺序:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Weighted_Avg_Total = sum(Count_Dist*Total),
Count_Dist = sum(Count_Dist)) %>%
select(Group, Count_Dist, Weighted_Avg_Total))
如果您还想包含 Count
列,您可以这样做(根据我在下面的评论):
df %>%
transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total),
Count = sum(Count)))
一种替代方法是使用 mutate
两次计算 Weighted_Total
并在 summarize
.
sum
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
mutate(Weighted_Total = Count_Dist*Total) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Weighted_Total))
Result:
Count_Dist Weighted_Total
1 1 656.074766
并且:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
mutate(Weighted_Total = Count_Dist*Total) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Weighted_Total))) %>%
rename(Weighted_Avg_Total = Total)
Result:
Group Count_Dist Weighted_Avg_Total
1 A 0.0934579439 50.000000
2 B 0.1401869159 300.000000
3 C 0.1121495327 600.000000
4 D 0.1869158879 400.000000
5 E 0.4672897196 1000.000000
6 All 1.0000000000 656.074766