具有缺失值的因子的变量平均值
Mean of variables by factors with missing values
我正在使用一个大数据集来按因子计算变量的均值。示例简单数据集如下所示。
+------+-----+------+--------+------+-------+------+------+
| year | mon | site | region | rf | avg 1 | avg2 | avg3 |
+------+-----+------+--------+------+-------+------+------+
| 2000 | jan | A | high | 28.2 | | | |
| 2000 | feb | A | high | 26.6 | | | |
| 2000 | mar | A | high | 30.3 | | | |
| 2000 | apr | A | high | 33.2 | | | |
| 2000 | may | A | high | | | | |
| 2000 | jun | A | high | 28.3 | | | |
| 2000 | jul | A | high | 28.6 | | | |
| 2000 | aug | A | high | 28.9 | | | |
| 2000 | sep | A | high | 28.1 | | | |
| 2000 | oct | A | high | 28.8 | | | |
| 2000 | nov | A | high | 31.6 | | | |
| 2000 | dec | A | high | 26.9 | | | |
| 2001 | jan | A | high | 28.6 | | | |
| 2001 | feb | A | high | 29.6 | | | |
| 2002 | jan | B | mid | 21.4 | | | |
| 2002 | feb | B | mid | 24.5 | | | |
| 2002 | mar | B | mid | 24.2 | | | |
+------+-----+------+--------+------+-------+------+------+
但是主变量 (rf) 有一些缺失值。但我想计算删除缺失值的方法 (avg 1, avg2 avg3)。可以使用以下输入代码访问我的数据集。
structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2002L,
2002L, 2002L), mon = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L, 5L, 4L, 5L, 4L, 8L), .Label = c("apr",
"aug", "dec", "feb", "jan", "jul", "jun", "mar", "may", "nov",
"oct", "sep"), class = "factor"), site = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("high",
"mid"), class = "factor"), rf = c(28.2, 26.6, 30.3, 33.2, NA,
28.3, 28.6, 28.9, 28.1, 28.8, 31.6, 26.9, 28.6, 29.6, 21.4, 24.5,
24.2), avg_rf_site_allyears = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), avg_mon_rf_all_site = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), avg_rf_year_ele = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), .Names = c("year", "mon", "site",
"region", "rf", "avg_rf_site_allyears", "avg_mon_rf_all_site",
"avg_rf_year_ele"), class = "data.frame", row.names = c(NA, -17L
))
avg 1 是所有年份的平均降雨量(我有 15 年的月度)。
avg 2 是所有地点所有年份的平均月降雨量
avg 3 是各地区年平均降雨量
我正在使用以下代码,但这些代码不适用于具有缺失值的网站。
平均 1
df$avg.1<- with(df,ave(rf, site)) # mean rf by sites across all years. This does not calculate values for sites if it has got even one missing value.
平均 2
df$avg2<- with(df,ave(rf, mon))#this works in this example but not with my big dataset. When I run with my dataset, it gives all NAs.
如果有人也能告诉我这个问题的潜在原因,那就太好了。
avg 3 - 我需要按年按地区计算均值。但是找不到解决方法。
如能提供上述任何帮助,我们将不胜感激。
我们可以在ave
中指定FUN
参数。默认情况下,即没有指定,它给出 mean
和 na.rm=FALSE
。因此,使用 FUN
,可以使用 min
、max
等任何其他函数。
df$avg.1 <- with(df, ave(rf, site,
FUN= function(x) mean(x, na.rm=TRUE)))
'avg.2'.
同样如此
对于第三种情况
df$avg.3 <- with(df, ave(rf, region, year,
FUN= function(x) mean(x, na.rm=TRUE))
如果我们使用 dplyr
library(dplyr)
df %>%
group_by(site) %>%
mutate(avg.1 = mean(rf, na.rm=TRUE)) %>%
group_by(mon) %>%
mutate(avg.2 = mean(rf, na.rm=TRUE)) %>%
group_by(region, year) %>%
mutate(avg.3= mean(rf, na.rm=TRUE))
我正在使用一个大数据集来按因子计算变量的均值。示例简单数据集如下所示。
+------+-----+------+--------+------+-------+------+------+ | year | mon | site | region | rf | avg 1 | avg2 | avg3 | +------+-----+------+--------+------+-------+------+------+ | 2000 | jan | A | high | 28.2 | | | | | 2000 | feb | A | high | 26.6 | | | | | 2000 | mar | A | high | 30.3 | | | | | 2000 | apr | A | high | 33.2 | | | | | 2000 | may | A | high | | | | | | 2000 | jun | A | high | 28.3 | | | | | 2000 | jul | A | high | 28.6 | | | | | 2000 | aug | A | high | 28.9 | | | | | 2000 | sep | A | high | 28.1 | | | | | 2000 | oct | A | high | 28.8 | | | | | 2000 | nov | A | high | 31.6 | | | | | 2000 | dec | A | high | 26.9 | | | | | 2001 | jan | A | high | 28.6 | | | | | 2001 | feb | A | high | 29.6 | | | | | 2002 | jan | B | mid | 21.4 | | | | | 2002 | feb | B | mid | 24.5 | | | | | 2002 | mar | B | mid | 24.2 | | | | +------+-----+------+--------+------+-------+------+------+
但是主变量 (rf) 有一些缺失值。但我想计算删除缺失值的方法 (avg 1, avg2 avg3)。可以使用以下输入代码访问我的数据集。
structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2002L,
2002L, 2002L), mon = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L, 5L, 4L, 5L, 4L, 8L), .Label = c("apr",
"aug", "dec", "feb", "jan", "jul", "jun", "mar", "may", "nov",
"oct", "sep"), class = "factor"), site = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("high",
"mid"), class = "factor"), rf = c(28.2, 26.6, 30.3, 33.2, NA,
28.3, 28.6, 28.9, 28.1, 28.8, 31.6, 26.9, 28.6, 29.6, 21.4, 24.5,
24.2), avg_rf_site_allyears = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), avg_mon_rf_all_site = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), avg_rf_year_ele = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), .Names = c("year", "mon", "site",
"region", "rf", "avg_rf_site_allyears", "avg_mon_rf_all_site",
"avg_rf_year_ele"), class = "data.frame", row.names = c(NA, -17L
))
avg 1 是所有年份的平均降雨量(我有 15 年的月度)。
avg 2 是所有地点所有年份的平均月降雨量
avg 3 是各地区年平均降雨量
我正在使用以下代码,但这些代码不适用于具有缺失值的网站。
平均 1
df$avg.1<- with(df,ave(rf, site)) # mean rf by sites across all years. This does not calculate values for sites if it has got even one missing value.
平均 2
df$avg2<- with(df,ave(rf, mon))#this works in this example but not with my big dataset. When I run with my dataset, it gives all NAs.
如果有人也能告诉我这个问题的潜在原因,那就太好了。
avg 3 - 我需要按年按地区计算均值。但是找不到解决方法。
如能提供上述任何帮助,我们将不胜感激。
我们可以在ave
中指定FUN
参数。默认情况下,即没有指定,它给出 mean
和 na.rm=FALSE
。因此,使用 FUN
,可以使用 min
、max
等任何其他函数。
df$avg.1 <- with(df, ave(rf, site,
FUN= function(x) mean(x, na.rm=TRUE)))
'avg.2'.
同样如此对于第三种情况
df$avg.3 <- with(df, ave(rf, region, year,
FUN= function(x) mean(x, na.rm=TRUE))
如果我们使用 dplyr
library(dplyr)
df %>%
group_by(site) %>%
mutate(avg.1 = mean(rf, na.rm=TRUE)) %>%
group_by(mon) %>%
mutate(avg.2 = mean(rf, na.rm=TRUE)) %>%
group_by(region, year) %>%
mutate(avg.3= mean(rf, na.rm=TRUE))