如何计算一列不同行的百分比？

Question

我正在尝试计算该值占职业和年份的百分比。例如，使用下面的 df，第一行的百分比将是：

665 /(665+709) = 48.4

我能够使用聚合来计算平均值，但我对如何计算百分比感到困惑：aggregate(x=df$value, by=list(df$occupation, df$year),FUN = mean)

df <- data.frame(
  year = c(rep(2003, 8), rep(2005, 8)),
  sex = c(rep(0, 4), rep(1, 4)),
  occupation = rep(c(1:4), 4),
  value = c(665, 661, 695, 450, 709, 460, 1033, 346, 808, 959, 651, 468, 756, 832, 1140, 431)
)

Answer 1

我想你要找的答案是：

aggregate(
  x = df$value,
  by = list(df$occupation, df$year),
  FUN = function(x) {
    round(x / sum(x) * 100, 1)
  }
)

基本上，答案的症结在于FUN论点；要计算百分比，您需要一个函数来告诉 R 在聚合时要做什么。由于 R 具有内置的均值函数，因此您可以在计算均值时提供 mean 至 FUN。 functional programming chapter Hadley Wickham 的 Advanced R 有很多关于构建命名函数和匿名函数的详细信息。

也就是说，对于像这样的数据操作任务，像 dplyr 这样的包确实 excel 使任务不那么复杂并且更容易阅读。您可以使用上面的综合答案，但除非您有理由这样做（例如构建一个包并且您想避免依赖），否则额外的包可以使您的代码更具可读性和可维护性。

library(dplyr)
output <- 
  df %>%
  group_by(year, occupation) %>%
  mutate(percent = round(value / sum(value) * 100, 1))

这种方法的另一个好处是，它可以将原始数据结构添加到比聚合更简洁的方式，默认情况下会产生可用但不漂亮的结果。

这个 vignette has a bunch of great examples of these types of data manipulation tasks. The dplyr/tidyr cheatsheet 对这类任务也很有帮助。

我的答案依赖于 dplyr，因为它是我的首选工具；肯定还有其他（plyr、data.table）可能更适合给定的任务。对于这个问题，我仍然喜欢 dplyr，但我提到了其他选项，因为它总是值得考虑 the best tool for the job.

如何计算一列不同行的百分比？

How to calculate the percentage in different rows of one column?

aggregate

r

summary