数字总和在数据框中有效,但在 tibble 中无效
sum of numbers works in data frame, but not in tibble
我想把一个tibble的一列中的所有数字相加作为学习R的练习,并使用了示例数据集forcats::gss_cat
。我想按年龄查看婚姻状况:
by_ag <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count()
by_age <- by_ag %>%
mutate(prop = n/sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
这就是我得到的:
显然,计算出的比例并不是真正的比例,因为sum(n)
实际上等于n
。为了帮助确定问题,我创建了一个小数据框:
df <- data.frame(type = c("new", "old", "don't know"), number = c(20, 12, 34))
也计算了一个比例:
df %>%
mutate(prop = number/sum(number))
这按预期工作:
# A tibble: 3 x 3
type number prop
<chr> <dbl> <dbl>
1 new 20.0 0.303
2 old 12.0 0.182
3 don't know 34.0 0.515
因此我将我的初始小标题转换为数据框并重新运行代码:
by_age <- as.data.frame(by_ag) %>%
mutate(prop = n/sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
并得到一个完美的情节:
因此,我的初步结论是,原因是我本来就有小毛病。因此,为了验证这个假设,我还创建了一个新的标题:
df <- tibble(type = c("new", "old", "don't know"), number = c(20, 12, 34))
df %>%
mutate(prop = number/sum(number))
然后像这里一样完全糊涂了,计算比例没有问题:
A tibble: 3 x 3
type number prop
<chr> <dbl> <dbl>
1 new 20.0 0.303
2 old 12.0 0.182
3 don't know 34.0 0.515
那么为什么 sum(n)
在我的初始示例中不起作用?
我想补充一点,这来自 R for Data Science (working with factors) 的练习,它们不会取消分组:
那么这可能是什么原因呢?
这里的'by_ag'是一个分组对象,因此,'n'的sum
是在每个'group'中进行求和。选项是提取列,即 .$n
by_ag %>%
mutate(prop = n/sum(.$n))
或 ungroup
对象,然后执行 sum
by_ag %>%
ungroup %>%
mutate(prop = n/sum(n))
为了说明差异,使用 OP 的 'df'
df %>%
group_by(type) %>%
mutate(Sum = sum(number))
# A tibble: 3 x 3
# Groups: type [3]
# type number Sum
# <fctr> <dbl> <dbl>
#1 new 20.0 20.0
#2 old 12.0 12.0
#3 don't know 34.0 34.0
df %>%
group_by(type) %>%
mutate(Sum = sum(.$number))
# A tibble: 3 x 3
# Groups: type [3]
# type number Sum
# <fctr> <dbl> <dbl>
#1 new 20.0 66.0
#2 old 12.0 66.0
#3 don't know 34.0 66.0
根据 OP 的评论,练习 here 使用了一个分组变量,该变量在 summarise
之后被剥离
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
str(relig_summary)
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 15 obs. of 4 variables:
# $ relig : Factor w/ 16 levels "No answer","Don't know",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ age : num 49.5 35.9 40 38.9 40.1 ...
# $ tvhours: num 2.72 4.62 2.87 3.46 2.79 ...
# $ n : int 93 15 109 23 689 95 104 32 71 147 ...
我们添加两个而不是一个分组变量,
by_ag <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count()
str(by_ag) #check the grouped_df class
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 351 obs. of 3 variables:
# $ age : int 18 18 19 19 19 19 20 20 20 20 ...
# $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 6 2 4 5 6 2 3 4 6 ...
# $ n : int 89 2 234 3 1 11 227 1 2 21 ...
# - attr(*, "vars")= chr "age" "marital"
# - attr(*, "drop")= logi TRUE
# - attr(*, "indices")=List of 351
当我们转换为data.frame
时,分组属性丢失
as.data.frame(by_ag) %>%
str
#'data.frame': 351 obs. of 3 variables:
#$ age : int 18 18 19 19 19 19 20 20 20 20 ...
#$ marital: Factor w/ 6 levels "No answer","Never married",..: 2 6 2 4 5 6 2 3 4 6 ...
#$ n : int 89 2 234 3 1 11 227 1 2 21 ...
类似于ungroup
by_ag %>%
ungroup %>%
str
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 351 obs. of 3 variables:
# $ age : int 18 18 19 19 19 19 20 20 20 20 ...
# $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 6 2 4 5 6 2 3 4 6 ...
# $ n : int 89 2 234 3 1 11 227 1 2 21 ...
我想把一个tibble的一列中的所有数字相加作为学习R的练习,并使用了示例数据集forcats::gss_cat
。我想按年龄查看婚姻状况:
by_ag <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count()
by_age <- by_ag %>%
mutate(prop = n/sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
这就是我得到的:
显然,计算出的比例并不是真正的比例,因为sum(n)
实际上等于n
。为了帮助确定问题,我创建了一个小数据框:
df <- data.frame(type = c("new", "old", "don't know"), number = c(20, 12, 34))
也计算了一个比例:
df %>%
mutate(prop = number/sum(number))
这按预期工作:
# A tibble: 3 x 3
type number prop
<chr> <dbl> <dbl>
1 new 20.0 0.303
2 old 12.0 0.182
3 don't know 34.0 0.515
因此我将我的初始小标题转换为数据框并重新运行代码:
by_age <- as.data.frame(by_ag) %>%
mutate(prop = n/sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
并得到一个完美的情节:
因此,我的初步结论是,原因是我本来就有小毛病。因此,为了验证这个假设,我还创建了一个新的标题:
df <- tibble(type = c("new", "old", "don't know"), number = c(20, 12, 34))
df %>%
mutate(prop = number/sum(number))
然后像这里一样完全糊涂了,计算比例没有问题:
A tibble: 3 x 3
type number prop
<chr> <dbl> <dbl>
1 new 20.0 0.303
2 old 12.0 0.182
3 don't know 34.0 0.515
那么为什么 sum(n)
在我的初始示例中不起作用?
我想补充一点,这来自 R for Data Science (working with factors) 的练习,它们不会取消分组:
这里的'by_ag'是一个分组对象,因此,'n'的sum
是在每个'group'中进行求和。选项是提取列,即 .$n
by_ag %>%
mutate(prop = n/sum(.$n))
或 ungroup
对象,然后执行 sum
by_ag %>%
ungroup %>%
mutate(prop = n/sum(n))
为了说明差异,使用 OP 的 'df'
df %>%
group_by(type) %>%
mutate(Sum = sum(number))
# A tibble: 3 x 3
# Groups: type [3]
# type number Sum
# <fctr> <dbl> <dbl>
#1 new 20.0 20.0
#2 old 12.0 12.0
#3 don't know 34.0 34.0
df %>%
group_by(type) %>%
mutate(Sum = sum(.$number))
# A tibble: 3 x 3
# Groups: type [3]
# type number Sum
# <fctr> <dbl> <dbl>
#1 new 20.0 66.0
#2 old 12.0 66.0
#3 don't know 34.0 66.0
根据 OP 的评论,练习 here 使用了一个分组变量,该变量在 summarise
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
str(relig_summary)
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 15 obs. of 4 variables:
# $ relig : Factor w/ 16 levels "No answer","Don't know",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ age : num 49.5 35.9 40 38.9 40.1 ...
# $ tvhours: num 2.72 4.62 2.87 3.46 2.79 ...
# $ n : int 93 15 109 23 689 95 104 32 71 147 ...
我们添加两个而不是一个分组变量,
by_ag <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count()
str(by_ag) #check the grouped_df class
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 351 obs. of 3 variables:
# $ age : int 18 18 19 19 19 19 20 20 20 20 ...
# $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 6 2 4 5 6 2 3 4 6 ...
# $ n : int 89 2 234 3 1 11 227 1 2 21 ...
# - attr(*, "vars")= chr "age" "marital"
# - attr(*, "drop")= logi TRUE
# - attr(*, "indices")=List of 351
当我们转换为data.frame
时,分组属性丢失
as.data.frame(by_ag) %>%
str
#'data.frame': 351 obs. of 3 variables:
#$ age : int 18 18 19 19 19 19 20 20 20 20 ...
#$ marital: Factor w/ 6 levels "No answer","Never married",..: 2 6 2 4 5 6 2 3 4 6 ...
#$ n : int 89 2 234 3 1 11 227 1 2 21 ...
类似于ungroup
by_ag %>%
ungroup %>%
str
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 351 obs. of 3 variables:
# $ age : int 18 18 19 19 19 19 20 20 20 20 ...
# $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 6 2 4 5 6 2 3 4 6 ...
# $ n : int 89 2 234 3 1 11 227 1 2 21 ...