dplyr mean group on Long Format Data
dplyr mean group on Long Format Data
我无法弄清楚如何使用 dplyr
计算长格式数据的简单均值。
我的数据是这样的:
hldid idno sex diary age
1 1294 1294_1 2 1 39
2 1294 1294_1 2 2 39
3 1294 1294_2 1 1 43
4 1294 1294_2 1 2 43
...
有 4 个变量:hldid idno sex diary age
idno
是 个人标识符 但不是 唯一密钥 。
每个人重复2次,每个diary
填写一个。
我想要的是通过 sex
简单地计算 age
的平均值。
你能帮帮我吗?
我试过类似的东西:
dta %>%
group_by(sex) %>%
mutate( ng = n_distinct(idno)) %>%
group_by(age, add=TRUE) %>%
summarise(mean = n()/ng[1] )
但它不起作用。
数据:
dta = structure(list(hldid = c(1294, 1294, 1294, 1294, 1352, 1352,
1352, 1352, 3741, 3741, 3741, 3741, 3809, 3809, 3809, 3809, 4037,
4037, 4037, 4037), idno = c("1294_1", "1294_1", "1294_2", "1294_2",
"1352_1", "1352_1", "1352_2", "1352_2", "3741_1", "3741_1", "3741_2",
"3741_2", "3809_1", "3809_1", "3809_2", "3809_2", "4037_1", "4037_1",
"4037_2", "4037_2"), sex = c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), diary = c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L), age = c(39L, 39L, 43L, 43L, 31L, 31L, 37L, 37L,
33L, 33L, 37L, 37L, 34L, 34L, 37L, 37L, 41L, 41L, 32L, 32L)), .Names = c("hldid",
"idno", "sex", "diary", "age"), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), vars = list(hldid), drop = TRUE, indices = list(
0:3, 4:7, 8:11, 12:15, 16:19), group_sizes = c(4L, 4L, 4L,
4L, 4L), biggest_group_size = 4L, labels = structure(list(hldid = c(1294,
1352, 3741, 3809, 4037)), class = "data.frame", row.names = c(NA,
-5L), .Names = "hldid", vars = list(hldid)))
更新快
也许这不适用于这个例子,
但我想到的这类问题如下:
假设我们有这样的数据:
3 名女性和 2 名男性,以及一个虚拟 act
变量。
如果我们这样做而不考虑计算 mean
的长格式,我们将会遇到问题。
aggregate(act ~ sex, FUN = mean, data = dtaTime)
我们应该做的是:
aggregate(act ~ sex, FUN = sum, data = dtaTime)
6 / 2 # men
10 / 3 # women
数据
dtaTime = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L),
sex = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), act = c(1L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L)), .Names = c("id", "sex",
"act"), class = "data.frame", row.names = c(NA, -25L))
你把事情搞得太复杂了,
dta %>%
group_by(sex) %>%
summarise(meanage = mean(age))
应该按性别给你平均年龄。
基础 R 替代方案:
aggregate(age ~ sex, dta, mean)
一个data.table
备选方案:
library(data.table)
setDT(dta)[, .(meanage = mean(age)), by = sex]
我无法弄清楚如何使用 dplyr
计算长格式数据的简单均值。
我的数据是这样的:
hldid idno sex diary age
1 1294 1294_1 2 1 39
2 1294 1294_1 2 2 39
3 1294 1294_2 1 1 43
4 1294 1294_2 1 2 43
...
有 4 个变量:hldid idno sex diary age
idno
是 个人标识符 但不是 唯一密钥 。
每个人重复2次,每个diary
填写一个。
我想要的是通过 sex
简单地计算 age
的平均值。
你能帮帮我吗?
我试过类似的东西:
dta %>%
group_by(sex) %>%
mutate( ng = n_distinct(idno)) %>%
group_by(age, add=TRUE) %>%
summarise(mean = n()/ng[1] )
但它不起作用。
数据:
dta = structure(list(hldid = c(1294, 1294, 1294, 1294, 1352, 1352,
1352, 1352, 3741, 3741, 3741, 3741, 3809, 3809, 3809, 3809, 4037,
4037, 4037, 4037), idno = c("1294_1", "1294_1", "1294_2", "1294_2",
"1352_1", "1352_1", "1352_2", "1352_2", "3741_1", "3741_1", "3741_2",
"3741_2", "3809_1", "3809_1", "3809_2", "3809_2", "4037_1", "4037_1",
"4037_2", "4037_2"), sex = c(2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), diary = c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L), age = c(39L, 39L, 43L, 43L, 31L, 31L, 37L, 37L,
33L, 33L, 37L, 37L, 34L, 34L, 37L, 37L, 41L, 41L, 32L, 32L)), .Names = c("hldid",
"idno", "sex", "diary", "age"), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), vars = list(hldid), drop = TRUE, indices = list(
0:3, 4:7, 8:11, 12:15, 16:19), group_sizes = c(4L, 4L, 4L,
4L, 4L), biggest_group_size = 4L, labels = structure(list(hldid = c(1294,
1352, 3741, 3809, 4037)), class = "data.frame", row.names = c(NA,
-5L), .Names = "hldid", vars = list(hldid)))
更新快
也许这不适用于这个例子, 但我想到的这类问题如下:
假设我们有这样的数据:
3 名女性和 2 名男性,以及一个虚拟 act
变量。
如果我们这样做而不考虑计算 mean
的长格式,我们将会遇到问题。
aggregate(act ~ sex, FUN = mean, data = dtaTime)
我们应该做的是:
aggregate(act ~ sex, FUN = sum, data = dtaTime)
6 / 2 # men
10 / 3 # women
数据
dtaTime = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L),
sex = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), act = c(1L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L)), .Names = c("id", "sex",
"act"), class = "data.frame", row.names = c(NA, -25L))
你把事情搞得太复杂了,
dta %>%
group_by(sex) %>%
summarise(meanage = mean(age))
应该按性别给你平均年龄。
基础 R 替代方案:
aggregate(age ~ sex, dta, mean)
一个data.table
备选方案:
library(data.table)
setDT(dta)[, .(meanage = mean(age)), by = sex]