所有行的多列加权平均值
wieghted mean on multiple columns for all rows
我想计算一个巨大数据集的加权平均值。
我需要的是以下内容(每一行),我有 NA
s,
所以我需要以某种方式合并 na.rm = TRUE
。
我想要计算以下内容(对于距离 1 到距离 10):
(distance1 * X1CityNumber + ... + distance10 * X10CityNumber) /
(X1CityNumber + ... + X10CityNumber)
我写了下面的代码,但它产生了错误的数字。
for (i in 1:378742) {
rcffull$distance[i] <- weighted.mean(cbind(rcffull$distance1[i],
rcffull$distance2[i],
rcffull$distance3[i],
rcffull$distance4[i],
rcffull$distance5[i],
rcffull$distance6[i],
rcffull$distance7[i],
rcffull$distance8[i],
rcffull$distance9[i],
rcffull$distance10[i]),
cbind(rcffull$X1CityNumber[i],
rcffull$X2CityNumber[i],
rcffull$X3CityNumber[i],
rcffull$X4CityNumber[i],
rcffull$X5CityNumber[i],
rcffull$X6CityNumber[i],
rcffull$X7CityNumber[i],
rcffull$X8CityNumber[i],
rcffull$X9CityNumber[i],
rcffull$X10CityNumber[i]),
na.rm = TRUE)
}
有什么建议吗?
具有较少列的样本数据:
distance1 Weights1 distance2 Weights2
1 5 3 8 2
2 NA 2 3 3
3 5 NA 4 4
#desired output:
Mean distance
1 6.2 #= (5 * 3 + 8 * 2) / (3 + 2)
2 3.0 #= (3 * 3) / 3
3 3.0 #= (4 * 4) / 4
NA
happens in both weights and distances. When doing (d1 * w1 + d2 * w2) / (w1 + w2)
, NA
should be eliminated from both nominator and denominator and normalization of weights needs account for this.
dat <- structure(list(distance1 = c(5L, NA, 5L), Weights1 = c(3L, 2L, NA),
distance2 = c(8L, 3L, 4L), Weights2 = c(2L, 3L, 4L)), .Names = c("distance1",
"Weights1", "distance2", "Weights2"), class = "data.frame", row.names = c("1",
"2", "3"))
A <- as.matrix(dat[c(1, 3)]) ## distance columns
B <- as.matrix(dat[c(2, 4)]) ## weight columns
B[is.na(A)] <- 0
rowSums(A * B, na.rm = TRUE) / rowSums(B, na.rm = TRUE)
# 1 2 3
#6.2 3.0 4.0
备注一:
如果数据和权重都没有NA
,就做
rowSums(A * B) / rowSums(B)
备注2:
另一种处理NA
的方法:将数据和权重中的所有NA
设置为0,然后使用rowSums
而不使用na.rm
:
ind <- is.na(A) | is.na(B)
A[ind] <- 0
B[ind] <- 0
rowSums(A * B) / rowSums(B)
备注三:
NaN
可能由于 0 / 0
而发生,如果没有一对非 NA
数据和非 NA
权重。
备注4:
weighted.mean
只能移除数据中的NA
,不能移除权重。这也是不希望的,因为您想对所有行进行计算。没有 "vectorized" 解决方案;你必须做一个缓慢的 R 级循环。
我想计算一个巨大数据集的加权平均值。
我需要的是以下内容(每一行),我有 NA
s,
所以我需要以某种方式合并 na.rm = TRUE
。
我想要计算以下内容(对于距离 1 到距离 10):
(distance1 * X1CityNumber + ... + distance10 * X10CityNumber) /
(X1CityNumber + ... + X10CityNumber)
我写了下面的代码,但它产生了错误的数字。
for (i in 1:378742) {
rcffull$distance[i] <- weighted.mean(cbind(rcffull$distance1[i],
rcffull$distance2[i],
rcffull$distance3[i],
rcffull$distance4[i],
rcffull$distance5[i],
rcffull$distance6[i],
rcffull$distance7[i],
rcffull$distance8[i],
rcffull$distance9[i],
rcffull$distance10[i]),
cbind(rcffull$X1CityNumber[i],
rcffull$X2CityNumber[i],
rcffull$X3CityNumber[i],
rcffull$X4CityNumber[i],
rcffull$X5CityNumber[i],
rcffull$X6CityNumber[i],
rcffull$X7CityNumber[i],
rcffull$X8CityNumber[i],
rcffull$X9CityNumber[i],
rcffull$X10CityNumber[i]),
na.rm = TRUE)
}
有什么建议吗?
具有较少列的样本数据:
distance1 Weights1 distance2 Weights2
1 5 3 8 2
2 NA 2 3 3
3 5 NA 4 4
#desired output:
Mean distance
1 6.2 #= (5 * 3 + 8 * 2) / (3 + 2)
2 3.0 #= (3 * 3) / 3
3 3.0 #= (4 * 4) / 4
NA
happens in both weights and distances. When doing(d1 * w1 + d2 * w2) / (w1 + w2)
,NA
should be eliminated from both nominator and denominator and normalization of weights needs account for this.
dat <- structure(list(distance1 = c(5L, NA, 5L), Weights1 = c(3L, 2L, NA),
distance2 = c(8L, 3L, 4L), Weights2 = c(2L, 3L, 4L)), .Names = c("distance1",
"Weights1", "distance2", "Weights2"), class = "data.frame", row.names = c("1",
"2", "3"))
A <- as.matrix(dat[c(1, 3)]) ## distance columns
B <- as.matrix(dat[c(2, 4)]) ## weight columns
B[is.na(A)] <- 0
rowSums(A * B, na.rm = TRUE) / rowSums(B, na.rm = TRUE)
# 1 2 3
#6.2 3.0 4.0
备注一:
如果数据和权重都没有NA
,就做
rowSums(A * B) / rowSums(B)
备注2:
另一种处理NA
的方法:将数据和权重中的所有NA
设置为0,然后使用rowSums
而不使用na.rm
:
ind <- is.na(A) | is.na(B)
A[ind] <- 0
B[ind] <- 0
rowSums(A * B) / rowSums(B)
备注三:
NaN
可能由于 0 / 0
而发生,如果没有一对非 NA
数据和非 NA
权重。
备注4:
weighted.mean
只能移除数据中的NA
,不能移除权重。这也是不希望的,因为您想对所有行进行计算。没有 "vectorized" 解决方案;你必须做一个缓慢的 R 级循环。