R中的大数据线性插值
Bigdata linear interpolation in R
我有一个包含一些 NA 值的大数据集。示例数据如下。
Data <- data.frame(col_1 = c('A','A','A','A', 'A', 'A', 'A', 'B', 'B', 'B'), col_2 = c('C','C', 'C', 'D', 'D','D', 'D', 'E', 'E', 'E'), col_3 = c(10,15,20, 10,20,25,30,5,10,15), value = c(0.9, NA, 0.6, 0.9, NA, NA,0.4, 0.8,NA,0.4))
我想用线性插值填充那些 NA
。
例如,为 col_1 = ‘A’ 和 col_2 = ‘C’
填写 NA
value = 0.9 + (0.6-0.9)*(15-10)/(20-10) = 0.75
第二个 NA
col_1 = ‘A’ 和 col_2 = ‘D’
value = 0.9 + (0.4-0.9)*(25-10)/(30-10) = 0.53
我的数据很大,有什么有效的方法吗?谢谢你。
预期的结果是。
Data_Updated <- data.frame(col_1 = c('A','A','A','A', 'A', 'A', 'A', 'B', 'B', 'B'), col_2 = c('C','C', 'C', 'D', 'D','D', 'D', 'E', 'E', 'E'), col_3 = c(10,15,20, 10,20,25,30,5,10,15), value = c(0.9, 0.75, 0.6, 0.9, 0.65, 0.53,0.4, 0.8,0.6,0.4))
试试这是否足够快:
library(data.table)
library(zoo)
setDT(Data)
Data[, value1 := na.approx(value, x = col_3), by = .(col_1, col_2)]
# col_1 col_2 col_3 value value1
# 1: A C 10 0.9 0.900
# 2: A C 15 NA 0.750
# 3: A C 20 0.6 0.600
# 4: A D 10 0.9 0.900
# 5: A D 20 NA 0.650
# 6: A D 25 NA 0.525
# 7: A D 30 0.4 0.400
# 8: B E 5 0.8 0.800
# 9: B E 10 NA 0.600
#10: B E 15 0.4 0.400
选项dplyr
library(dplyr)
library(zoo)
Data %>%
group_by(col_1, col_2) %>%
mutate(value1 = na.approx(value, x = col_3))
我有一个包含一些 NA 值的大数据集。示例数据如下。
Data <- data.frame(col_1 = c('A','A','A','A', 'A', 'A', 'A', 'B', 'B', 'B'), col_2 = c('C','C', 'C', 'D', 'D','D', 'D', 'E', 'E', 'E'), col_3 = c(10,15,20, 10,20,25,30,5,10,15), value = c(0.9, NA, 0.6, 0.9, NA, NA,0.4, 0.8,NA,0.4))
我想用线性插值填充那些 NA
。
例如,为 col_1 = ‘A’ 和 col_2 = ‘C’
NA
value = 0.9 + (0.6-0.9)*(15-10)/(20-10) = 0.75
第二个 NA
col_1 = ‘A’ 和 col_2 = ‘D’
value = 0.9 + (0.4-0.9)*(25-10)/(30-10) = 0.53
我的数据很大,有什么有效的方法吗?谢谢你。 预期的结果是。
Data_Updated <- data.frame(col_1 = c('A','A','A','A', 'A', 'A', 'A', 'B', 'B', 'B'), col_2 = c('C','C', 'C', 'D', 'D','D', 'D', 'E', 'E', 'E'), col_3 = c(10,15,20, 10,20,25,30,5,10,15), value = c(0.9, 0.75, 0.6, 0.9, 0.65, 0.53,0.4, 0.8,0.6,0.4))
试试这是否足够快:
library(data.table)
library(zoo)
setDT(Data)
Data[, value1 := na.approx(value, x = col_3), by = .(col_1, col_2)]
# col_1 col_2 col_3 value value1
# 1: A C 10 0.9 0.900
# 2: A C 15 NA 0.750
# 3: A C 20 0.6 0.600
# 4: A D 10 0.9 0.900
# 5: A D 20 NA 0.650
# 6: A D 25 NA 0.525
# 7: A D 30 0.4 0.400
# 8: B E 5 0.8 0.800
# 9: B E 10 NA 0.600
#10: B E 15 0.4 0.400
选项dplyr
library(dplyr)
library(zoo)
Data %>%
group_by(col_1, col_2) %>%
mutate(value1 = na.approx(value, x = col_3))