如何根据不同列的值创建新列并计算 R 中另一个数字列的百分比值？

Question

示例数据框：

no <- rep(1:5, each=2)
type <- rep(LETTERS[1:2], times=5)
set.seed(4)
value <- round(runif(10, 10, 30))

df <- data.frame(no, type, value)

df

    no type value
1   1    A    22
2   1    B    10
3   2    A    16
4   2    B    16
5   3    A    26
6   3    B    15
7   4    A    24
8   4    B    28
9   5    A    29
10  5    B    11

现在我想要的是计算每种类型（A 或 B）的百分比值并创建单独的列。期望的输出是这样的：

 no    pct_A    pct_B total_value
1  1 68.75000 31.25000          32
2  2 50.00000 50.00000          32
3  3 63.41463 36.58537          41
4  4 46.15385 53.84615          52
5  5 72.50000 27.50000          40

到目前为止我已经尝试过的（这给出了正确的输出，但过程似乎非常次优）：

df %>%
  group_by(no) %>%
  mutate(total_value= sum(value))-> df

df %>%
  mutate(pct_A=ifelse(type=='A', (value/total_value) *100, 0),
         pct_B=ifelse(type=='B', (value/total_value) *100, 0)) %>%
  group_by(no) %>%
  summarise(pct_A=sum(pct_A),
            pct_B=sum(pct_B)) %>%
  ungroup() %>%
  merge(df) %>%
  distinct(no, .keep_all = T) %>%
  select(-type, -value)

有更好的方法吗？特别是使用 dplyr?

我也在寻找其他答案，但没有帮助。这个更近了：

Answer 1

对于每个 no 我们可以计算 sum 和比率然后得到宽格式的数据。

library(dplyr)
library(tidyr)

df %>%
  group_by(no) %>%
  mutate(total_value = sum(value),
         value = prop.table(value) * 100) %>%
  ungroup %>%
  pivot_wider(names_from = type, values_from = value, names_prefix = 'pct_')

#     no total_value pct_A pct_B
#  <int>       <dbl> <dbl> <dbl>
#1     1          32  68.8  31.2
#2     2          32  50    50  
#3     3          41  63.4  36.6
#4     4          52  46.2  53.8
#5     5          40  72.5  27.5

Answer 2

还有两种方法可以做到这一点。

我们可以使用 purrr::map_dfc。但是，设置正确的列名有点麻烦：

library(dplyr)
library(purrr)

df %>% 
  group_by(no) %>% 
  summarise(total_value = sum(value),
            map_dfc(unique(type) %>% set_names(., paste0("pct_",.)), 
                    ~ sum((type == .x) * value) / total_value * 100)
  )

#> # A tibble: 5 x 4
#>      no total_value pct_A pct_B
#>   <int>       <dbl> <dbl> <dbl>
#> 1     1          32  68.8  31.2
#> 2     2          32  50    50  
#> 3     3          41  63.4  36.6
#> 4     4          52  46.2  53.8
#> 5     5          40  72.5  27.5

或者我们可以使用 dplyover::over（免责声明：我是维护者），它允许我们以类似 across 的方式即时创建名称：

library(dplyover) # https://github.com/TimTeaFan/dplyover

df %>% 
  group_by(no) %>% 
  summarise(total_value = sum(value),
            over(dist_values(type), # alternatively `unique(type)`
                 ~ sum((type == .x) * value) / total_value * 100,
                 .names = "pct_{x}")
            )

#> # A tibble: 5 x 4
#>      no total_value pct_A pct_B
#>   <int>       <dbl> <dbl> <dbl>
#> 1     1          32  68.8  31.2
#> 2     2          32  50    50  
#> 3     3          41  63.4  36.6
#> 4     4          52  46.2  53.8
#> 5     5          40  72.5  27.5

^{由 reprex package (v2.0.1)}

于 2021-09-17 创建

在性能方面，与 pivot_wider 等数据矩形处理方法相比，这两种方法都应该更快（但我还没有测试过这种特定情况）。

Answer 3

您可以使用 aggregate 在 base 中完成。

do.call(data.frame, aggregate(value ~ no, df, \(x) c(proportions(x), sum(x)))) |>
  setNames(c('no', 'pct_A', 'pct_B', 'total_value'))
#   no     pct_A     pct_B total_value
# 1  1 0.6875000 0.3125000          32
# 2  2 0.5000000 0.5000000          32
# 3  3 0.6341463 0.3658537          41
# 4  4 0.4615385 0.5384615          52
# 5  5 0.7250000 0.2750000          40

如何根据不同列的值创建新列并计算 R 中另一个数字列的百分比值？

How do I create new columns based on the values of a different column and count the percentage value of another numerical column in R?

r

dataframe

dplyr

data-wrangling