在 R 中值 >0 的两列之间插入 0 的行值
Interpolate row-wise values of 0 between two columns with values >0 in R
我尝试在列的两个不等于零的值之间插入 0 值:R 中 data.table
的 2018 年到 2021 年。这就是示例数据 df1
的方式看起来像:
ID string1 2018 2019 2020 2021 string2
1: a1 x2 3 3 0 4 si
2: a2 g3 5 5 4 0 q2
3: a3 n2 11 0 0 3 oq
4: a4 m3 3 0 9 8 mx
5: a5 2w 9 1 6 5 ix
6: a6 ps2 2 4 7 4 p2
7: a7 kg2 6 0 9 6 2q
为了方便重现:
df1 = data.table(
ID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
"string1" = c("x2", "g3", "n2", "m3", "2w", "ps2", "kg2"),
"2018" = c(3,5,11,3,9,2,6),
"2019" = c(3,5,0,0,1,4,0),
"2020" = c(0,4,0,9,6,7,9),
"2021" = c(4,0,3,8,5,4,6),
"string2" = c("si", "q2", "oq", "mx", "ix", "p2", "2q"))
在 df1
中,存在两个数字之间为零的情况 >0(例如;第 1 行/2020 列、第 4 行/2019 列或第 7 行 2019 列)。我尝试识别这些情况并将它们与相邻列进行插值(例如;第 1 行/第 2020 列:3 + 4 =3.5)。
有办法解决吗?到目前为止,我只找到了一种方法来替换所有的零值,但没有条件是两个数字之间 >0.
我尝试得到这样的输出:
ID string1 2018 2019 2020 2021 string2
1: a1 x2 3 3.0 3.5 4 si
2: a2 g3 5 5.0 4.0 0 q2
3: a3 n2 11 0.0 0.0 3 oq
4: a4 m3 3 6.0 9.0 8 mx
5: a5 2w 9 1.0 6.0 5 ix
6: a6 ps2 2 4.0 7.0 4 p2
7: a7 kg2 6 7.5 9.0 6 2q
非常感谢!
在两个正元素之间插入零的函数:
f <- function(vec){
prev_val <- shift(vec, 1, fill = 0)
next_val <- shift(vec, -1, fill = 0)
fifelse(prev_val > 0 & next_val > 0 & vec == 0, (prev_val + next_val) / 2, vec)
}
将函数应用于年份列的所有行:
year_cols <- names(df1)[grep("^[0-9]+$", names(df1))]
df1[, (year_cols) := transpose(lapply(transpose(.SD), f)), .SDcols = year_cols]
使用 transpose
是因为您想对行进行更改。第二个用途是return把它变成列格式。
使用data.table
函数(和原来的data.frame),这段代码(有点麻烦)应该可以工作:
for (i in c(2019,2020)){
x = which(colnames(df1) == i)
df1[,x] <- ifelse(c(df1[,.SD,.SDcols = x] == 0 & df1[,.SD,.SDcols = c(x-1)] > 0 & df1[,.SD,.SDcols = c(x+1)] > 0),
rowMeans(df1[,.SD,.SDcols = c(x-1,x+1)]), unlist(df1[,.SD,.SDcols = x]))
}
> df1
ID string1 2018 2019 2020 2021 string2
1: a1 x2 3 3.0 3.5 4 si
2: a2 g3 5 5.0 4.0 0 q2
3: a3 n2 11 0.0 0.0 3 oq
4: a4 m3 3 6.0 9.0 8 mx
5: a5 2w 9 1.0 6.0 5 ix
6: a6 ps2 2 4.0 7.0 4 p2
7: a7 kg2 6 7.5 9.0 6 2q
这是一个基本的 R 解决方案(使用 data.frame
而不是 data.table
来生成数据):
for (i in c("X2019","X2020")){
x = which(colnames(df1) == i)
df1[,x] <- ifelse(df1[,x] == 0 & df1[,x-1] > 0 & df1[,x+1] > 0, rowMeans(df1[,c(x-1,x+1)]), df1[,x])
}
也许这有点矫枉过正,但这里有一个使用两次重塑的解决方案:
melt(df1, measure.vars = patterns("^[0-9]+$")
)[,value := fifelse(value == 0 &
shift(value, type = "lag", fill = 0) > 0 &
shift(value, type = "lead", fill = 0) > 0,
(shift(value, type = "lag") + shift(value, type = "lead")) / 2,
value), by = ID
][, dcast(.SD, ...~variable) ]
# ID string1 string2 2018 2019 2020 2021
# 1: a1 x2 si 3 3.0 3.5 4
# 2: a2 g3 q2 5 5.0 4.0 0
# 3: a3 n2 oq 11 0.0 0.0 3
# 4: a4 m3 mx 3 6.0 9.0 8
# 5: a5 2w ix 9 1.0 6.0 5
# 6: a6 ps2 p2 2 4.0 7.0 4
# 7: a7 kg2 2q 6 7.5 9.0 6
编辑: 要填写所有 NA,我们可以使用 zoo::na.approx 或 zoo ::na.spline
cols <- grep("^[0-9]+$", names(df1), value = TRUE)
df1[, (cols) := transpose(lapply(transpose(.SD), function(i) zoo::na.approx(
ifelse(i == 0, NA, i), na.rm = FALSE))),
.SDcols = cols ]
# Using na.approx, notice 2nd row for 2021 is NA.
# ID string1 2018 2019 2020 2021 string2
# 1: a1 x2 3 3.000000 3.500000 4 si
# 2: a2 g3 5 5.000000 4.000000 NA q2
# 3: a3 n2 11 8.333333 5.666667 3 oq
# 4: a4 m3 3 6.000000 9.000000 8 mx
# 5: a5 2w 9 1.000000 6.000000 5 ix
# 6: a6 ps2 2 4.000000 7.000000 4 p2
# 7: a7 kg2 6 7.500000 9.000000 6 2q
# Using na.spline
# ID string1 2018 2019 2020 2021 string2
# 1: a1 x2 3 3.000000 3.333333 4 si
# 2: a2 g3 5 5.000000 4.000000 2 q2
# 3: a3 n2 11 8.333333 5.666667 3 oq
# 4: a4 m3 3 7.333333 9.000000 8 mx
# 5: a5 2w 9 1.000000 6.000000 5 ix
# 6: a6 ps2 2 4.000000 7.000000 4 p2
# 7: a7 kg2 6 9.000000 9.000000 6 2q
我尝试在列的两个不等于零的值之间插入 0 值:R 中 data.table
的 2018 年到 2021 年。这就是示例数据 df1
的方式看起来像:
ID string1 2018 2019 2020 2021 string2
1: a1 x2 3 3 0 4 si
2: a2 g3 5 5 4 0 q2
3: a3 n2 11 0 0 3 oq
4: a4 m3 3 0 9 8 mx
5: a5 2w 9 1 6 5 ix
6: a6 ps2 2 4 7 4 p2
7: a7 kg2 6 0 9 6 2q
为了方便重现:
df1 = data.table(
ID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
"string1" = c("x2", "g3", "n2", "m3", "2w", "ps2", "kg2"),
"2018" = c(3,5,11,3,9,2,6),
"2019" = c(3,5,0,0,1,4,0),
"2020" = c(0,4,0,9,6,7,9),
"2021" = c(4,0,3,8,5,4,6),
"string2" = c("si", "q2", "oq", "mx", "ix", "p2", "2q"))
在 df1
中,存在两个数字之间为零的情况 >0(例如;第 1 行/2020 列、第 4 行/2019 列或第 7 行 2019 列)。我尝试识别这些情况并将它们与相邻列进行插值(例如;第 1 行/第 2020 列:3 + 4 =3.5)。
有办法解决吗?到目前为止,我只找到了一种方法来替换所有的零值,但没有条件是两个数字之间 >0.
我尝试得到这样的输出:
ID string1 2018 2019 2020 2021 string2
1: a1 x2 3 3.0 3.5 4 si
2: a2 g3 5 5.0 4.0 0 q2
3: a3 n2 11 0.0 0.0 3 oq
4: a4 m3 3 6.0 9.0 8 mx
5: a5 2w 9 1.0 6.0 5 ix
6: a6 ps2 2 4.0 7.0 4 p2
7: a7 kg2 6 7.5 9.0 6 2q
非常感谢!
在两个正元素之间插入零的函数:
f <- function(vec){
prev_val <- shift(vec, 1, fill = 0)
next_val <- shift(vec, -1, fill = 0)
fifelse(prev_val > 0 & next_val > 0 & vec == 0, (prev_val + next_val) / 2, vec)
}
将函数应用于年份列的所有行:
year_cols <- names(df1)[grep("^[0-9]+$", names(df1))]
df1[, (year_cols) := transpose(lapply(transpose(.SD), f)), .SDcols = year_cols]
使用 transpose
是因为您想对行进行更改。第二个用途是return把它变成列格式。
使用data.table
函数(和原来的data.frame),这段代码(有点麻烦)应该可以工作:
for (i in c(2019,2020)){
x = which(colnames(df1) == i)
df1[,x] <- ifelse(c(df1[,.SD,.SDcols = x] == 0 & df1[,.SD,.SDcols = c(x-1)] > 0 & df1[,.SD,.SDcols = c(x+1)] > 0),
rowMeans(df1[,.SD,.SDcols = c(x-1,x+1)]), unlist(df1[,.SD,.SDcols = x]))
}
> df1
ID string1 2018 2019 2020 2021 string2
1: a1 x2 3 3.0 3.5 4 si
2: a2 g3 5 5.0 4.0 0 q2
3: a3 n2 11 0.0 0.0 3 oq
4: a4 m3 3 6.0 9.0 8 mx
5: a5 2w 9 1.0 6.0 5 ix
6: a6 ps2 2 4.0 7.0 4 p2
7: a7 kg2 6 7.5 9.0 6 2q
这是一个基本的 R 解决方案(使用 data.frame
而不是 data.table
来生成数据):
for (i in c("X2019","X2020")){
x = which(colnames(df1) == i)
df1[,x] <- ifelse(df1[,x] == 0 & df1[,x-1] > 0 & df1[,x+1] > 0, rowMeans(df1[,c(x-1,x+1)]), df1[,x])
}
也许这有点矫枉过正,但这里有一个使用两次重塑的解决方案:
melt(df1, measure.vars = patterns("^[0-9]+$")
)[,value := fifelse(value == 0 &
shift(value, type = "lag", fill = 0) > 0 &
shift(value, type = "lead", fill = 0) > 0,
(shift(value, type = "lag") + shift(value, type = "lead")) / 2,
value), by = ID
][, dcast(.SD, ...~variable) ]
# ID string1 string2 2018 2019 2020 2021
# 1: a1 x2 si 3 3.0 3.5 4
# 2: a2 g3 q2 5 5.0 4.0 0
# 3: a3 n2 oq 11 0.0 0.0 3
# 4: a4 m3 mx 3 6.0 9.0 8
# 5: a5 2w ix 9 1.0 6.0 5
# 6: a6 ps2 p2 2 4.0 7.0 4
# 7: a7 kg2 2q 6 7.5 9.0 6
编辑: 要填写所有 NA,我们可以使用 zoo::na.approx 或 zoo ::na.spline
cols <- grep("^[0-9]+$", names(df1), value = TRUE)
df1[, (cols) := transpose(lapply(transpose(.SD), function(i) zoo::na.approx(
ifelse(i == 0, NA, i), na.rm = FALSE))),
.SDcols = cols ]
# Using na.approx, notice 2nd row for 2021 is NA.
# ID string1 2018 2019 2020 2021 string2
# 1: a1 x2 3 3.000000 3.500000 4 si
# 2: a2 g3 5 5.000000 4.000000 NA q2
# 3: a3 n2 11 8.333333 5.666667 3 oq
# 4: a4 m3 3 6.000000 9.000000 8 mx
# 5: a5 2w 9 1.000000 6.000000 5 ix
# 6: a6 ps2 2 4.000000 7.000000 4 p2
# 7: a7 kg2 6 7.500000 9.000000 6 2q
# Using na.spline
# ID string1 2018 2019 2020 2021 string2
# 1: a1 x2 3 3.000000 3.333333 4 si
# 2: a2 g3 5 5.000000 4.000000 2 q2
# 3: a3 n2 11 8.333333 5.666667 3 oq
# 4: a4 m3 3 7.333333 9.000000 8 mx
# 5: a5 2w 9 1.000000 6.000000 5 ix
# 6: a6 ps2 2 4.000000 7.000000 4 p2
# 7: a7 kg2 6 9.000000 9.000000 6 2q