R - 如何在数据帧中的两个对应 ID 之间填充 NA
R - how to fill NA's between two corresponding ID's in a dataframe
我正在尝试获取以下数据集并将其转换为第二个数据集。基本上,我试图用该 ID 填充每个 ID 之间的 NA。
每个 ID 对应两个时间戳,我已将其加入到更大的 date_time 列中。出于重现能力的目的,在连接之间执行 sql(date_time 列非常大)或者甚至获取原始数据集并在每个 id 之间创建时间戳然后加入它(我有太多 ID 无法执行此操作)。我已经成功完成了这两种方法,但对于我拥有的数据量来说,这需要太多时间。我希望用这个数据集来操作数据。看似很简单的事情,却真的让我难住了。任何帮助,将不胜感激。
当前数据集:
date_time id
<dttm> <dbl>
1 2017-01-30 08:00:00 NA
2 2017-01-30 08:00:01 NA
3 2017-01-30 08:00:02 1
4 2017-01-30 08:00:03 NA
5 2017-01-30 08:00:04 NA
6 2017-01-30 08:00:05 NA
7 2017-01-30 08:00:06 NA
8 2017-01-30 08:00:07 1
9 2017-01-30 08:00:08 NA
10 2017-01-30 08:00:09 NA
11 2017-01-30 08:00:10 2
12 2017-01-30 08:00:11 NA
13 2017-01-30 08:00:12 NA
14 2017-01-30 08:00:13 NA
15 2017-01-30 08:00:14 2
16 2017-01-30 08:00:15 NA
17 2017-01-30 08:00:16 3
18 2017-01-30 08:00:17 NA
19 2017-01-30 08:00:18 3
20 2017-01-30 08:00:19 NA
所需数据集:
date_time id
<dttm> <dbl>
1 2017-01-30 08:00:00 NA
2 2017-01-30 08:00:01 NA
3 2017-01-30 08:00:02 1
4 2017-01-30 08:00:03 1
5 2017-01-30 08:00:04 1
6 2017-01-30 08:00:05 1
7 2017-01-30 08:00:06 1
8 2017-01-30 08:00:07 1
9 2017-01-30 08:00:08 NA
10 2017-01-30 08:00:09 NA
11 2017-01-30 08:00:10 2
12 2017-01-30 08:00:11 2
13 2017-01-30 08:00:12 2
14 2017-01-30 08:00:13 2
15 2017-01-30 08:00:14 2
16 2017-01-30 08:00:15 NA
17 2017-01-30 08:00:16 3
18 2017-01-30 08:00:17 3
19 2017-01-30 08:00:18 3
20 2017-01-30 08:00:19 NA
dput() 日期:
structure(list(date_time = structure(c(1485781200, 1485781201,
1485781202, 1485781203, 1485781204, 1485781205, 1485781206, 1485781207,
1485781208, 1485781209, 1485781210, 1485781211, 1485781212, 1485781213,
1485781214, 1485781215, 1485781216, 1485781217, 1485781218, 1485781219
), class = c("POSIXct", "POSIXt"), tzone = ""), trx_id = c(NA_real_,
NA_real_, 1, NA_real_, NA_real_, NA_real_, NA_real_, 1,
NA_real_, NA_real_, 2, NA_real_, NA_real_, NA_real_, 2,
NA_real_, 3, NA_real_, 3, NA_real_)), .Names = c("date_time",
"trx_id"), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
一种解决方案是使用 tidyr
中的 fill
函数。方法很简单。首先为 prev
和 next
值创建 2 列。使用 fill
填充两列中的缺失值。
现在,对于在 prev_val
和 next_val
中具有相同值的行,值应该更新为 prev_val
(这意味着那些缺失值在相同的数字之间)
df <- read.table(text = "sl date_time, value
1 '2017-01-30 08:00:00' NA
2 '2017-01-30 08:00:01' NA
3 '2017-01-30 08:00:02' 1
4 '2017-01-30 08:00:03' NA
5 '2017-01-30 08:00:04' NA
6 '2017-01-30 08:00:05' NA
7 '2017-01-30 08:00:06' NA
8 '2017-01-30 08:00:07' 1
9 '2017-01-30 08:00:08' NA
10 '2017-01-30 08:00:09' NA
11 '2017-01-30 08:00:10' 2
12 '2017-01-30 08:00:11' NA
13 '2017-01-30 08:00:12' NA
14 '2017-01-30 08:00:13' NA
15 '2017-01-30 08:00:14' 2
16 '2017-01-30 08:00:15' NA
17 '2017-01-30 08:00:16' 3
18 '2017-01-30 08:00:17' NA
19 '2017-01-30 08:00:18' 3
20 '2017-01-30 08:00:19' NA", header = T, stringsAsFactor = F)
#use fill to find missing values
df %>%
mutate(prev_val = (value), next_val = (value)) %>%
fill(prev_val, .direction = "down") %>%
fill(next_val, .direction = "up") %>%
mutate(value = ifelse(prev_val == next_val, prev_val, value )) %>%
select(-prev_val, -next_val)
Result:
sl date_time. value
1 1 2017-01-30 08:00:00 NA
2 2 2017-01-30 08:00:01 NA
3 3 2017-01-30 08:00:02 1
4 4 2017-01-30 08:00:03 1
5 5 2017-01-30 08:00:04 1
6 6 2017-01-30 08:00:05 1
7 7 2017-01-30 08:00:06 1
8 8 2017-01-30 08:00:07 1
9 9 2017-01-30 08:00:08 NA
10 10 2017-01-30 08:00:09 NA
11 11 2017-01-30 08:00:10 2
12 12 2017-01-30 08:00:11 2
13 13 2017-01-30 08:00:12 2
14 14 2017-01-30 08:00:13 2
15 15 2017-01-30 08:00:14 2
16 16 2017-01-30 08:00:15 NA
17 17 2017-01-30 08:00:16 3
18 18 2017-01-30 08:00:17 3
19 19 2017-01-30 08:00:18 3
20 20 2017-01-30 08:00:19 NA
这里有一个base R
选项。我们split
数据集的行序列'trx_id'(一个OP显示为输入数据),得到序列(seq
),stack
它到两列数据集并根据 'values' 作为来自 'd1'
的索引,将 'trx_id' 分配给 'd1' 的 'ind' 列
d1 <- stack(lapply(split(seq_len(nrow(df1)), df1$trx_id), function(x) seq(x[1], x[2])))
df1$trx_id[d1$values] <- d1$ind
df1$trx_id
#[1] NA NA 1 1 1 1 1 1 NA NA 2 2 2 2 2 NA 3 3 3 NA
非 tidyr 方法,其中 x 是您的 ID 列:
x <- c(NA,NA, 1,NA,NA,1, NA, NA, 2, NA, NA,2, NA, 3,NA, NA,3,NA)
timestamps <- paste(unique(x)[!is.na(unique(x))], collapse = "|")
timestamps <- grep(timestamps, x)
timestamps <- matrix(timestamps, ncol = 2, byrow=TRUE)
xmatrix <- apply(timestamps, MARGIN = 1, FUN = function(i) {
y <- x[i[1]:i[2]]
y[is.na(y)] <- x[i][1]
x[i[1]:i[2]] <- y
return(x)
})
(x <- apply(xmatrix, 1,FUN = function(z) {
ifelse(all(is.na(z)), NA, max(z, na.rm=TRUE))
}))
## [1] NA NA 1 1 1 1 NA NA 2 2 2 2 NA 3 3 3 3 NA
HTH
我正在尝试获取以下数据集并将其转换为第二个数据集。基本上,我试图用该 ID 填充每个 ID 之间的 NA。
每个 ID 对应两个时间戳,我已将其加入到更大的 date_time 列中。出于重现能力的目的,在连接之间执行 sql(date_time 列非常大)或者甚至获取原始数据集并在每个 id 之间创建时间戳然后加入它(我有太多 ID 无法执行此操作)。我已经成功完成了这两种方法,但对于我拥有的数据量来说,这需要太多时间。我希望用这个数据集来操作数据。看似很简单的事情,却真的让我难住了。任何帮助,将不胜感激。
当前数据集:
date_time id
<dttm> <dbl>
1 2017-01-30 08:00:00 NA
2 2017-01-30 08:00:01 NA
3 2017-01-30 08:00:02 1
4 2017-01-30 08:00:03 NA
5 2017-01-30 08:00:04 NA
6 2017-01-30 08:00:05 NA
7 2017-01-30 08:00:06 NA
8 2017-01-30 08:00:07 1
9 2017-01-30 08:00:08 NA
10 2017-01-30 08:00:09 NA
11 2017-01-30 08:00:10 2
12 2017-01-30 08:00:11 NA
13 2017-01-30 08:00:12 NA
14 2017-01-30 08:00:13 NA
15 2017-01-30 08:00:14 2
16 2017-01-30 08:00:15 NA
17 2017-01-30 08:00:16 3
18 2017-01-30 08:00:17 NA
19 2017-01-30 08:00:18 3
20 2017-01-30 08:00:19 NA
所需数据集:
date_time id
<dttm> <dbl>
1 2017-01-30 08:00:00 NA
2 2017-01-30 08:00:01 NA
3 2017-01-30 08:00:02 1
4 2017-01-30 08:00:03 1
5 2017-01-30 08:00:04 1
6 2017-01-30 08:00:05 1
7 2017-01-30 08:00:06 1
8 2017-01-30 08:00:07 1
9 2017-01-30 08:00:08 NA
10 2017-01-30 08:00:09 NA
11 2017-01-30 08:00:10 2
12 2017-01-30 08:00:11 2
13 2017-01-30 08:00:12 2
14 2017-01-30 08:00:13 2
15 2017-01-30 08:00:14 2
16 2017-01-30 08:00:15 NA
17 2017-01-30 08:00:16 3
18 2017-01-30 08:00:17 3
19 2017-01-30 08:00:18 3
20 2017-01-30 08:00:19 NA
dput() 日期:
structure(list(date_time = structure(c(1485781200, 1485781201,
1485781202, 1485781203, 1485781204, 1485781205, 1485781206, 1485781207,
1485781208, 1485781209, 1485781210, 1485781211, 1485781212, 1485781213,
1485781214, 1485781215, 1485781216, 1485781217, 1485781218, 1485781219
), class = c("POSIXct", "POSIXt"), tzone = ""), trx_id = c(NA_real_,
NA_real_, 1, NA_real_, NA_real_, NA_real_, NA_real_, 1,
NA_real_, NA_real_, 2, NA_real_, NA_real_, NA_real_, 2,
NA_real_, 3, NA_real_, 3, NA_real_)), .Names = c("date_time",
"trx_id"), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
一种解决方案是使用 tidyr
中的 fill
函数。方法很简单。首先为 prev
和 next
值创建 2 列。使用 fill
填充两列中的缺失值。
现在,对于在 prev_val
和 next_val
中具有相同值的行,值应该更新为 prev_val
(这意味着那些缺失值在相同的数字之间)
df <- read.table(text = "sl date_time, value
1 '2017-01-30 08:00:00' NA
2 '2017-01-30 08:00:01' NA
3 '2017-01-30 08:00:02' 1
4 '2017-01-30 08:00:03' NA
5 '2017-01-30 08:00:04' NA
6 '2017-01-30 08:00:05' NA
7 '2017-01-30 08:00:06' NA
8 '2017-01-30 08:00:07' 1
9 '2017-01-30 08:00:08' NA
10 '2017-01-30 08:00:09' NA
11 '2017-01-30 08:00:10' 2
12 '2017-01-30 08:00:11' NA
13 '2017-01-30 08:00:12' NA
14 '2017-01-30 08:00:13' NA
15 '2017-01-30 08:00:14' 2
16 '2017-01-30 08:00:15' NA
17 '2017-01-30 08:00:16' 3
18 '2017-01-30 08:00:17' NA
19 '2017-01-30 08:00:18' 3
20 '2017-01-30 08:00:19' NA", header = T, stringsAsFactor = F)
#use fill to find missing values
df %>%
mutate(prev_val = (value), next_val = (value)) %>%
fill(prev_val, .direction = "down") %>%
fill(next_val, .direction = "up") %>%
mutate(value = ifelse(prev_val == next_val, prev_val, value )) %>%
select(-prev_val, -next_val)
Result:
sl date_time. value
1 1 2017-01-30 08:00:00 NA
2 2 2017-01-30 08:00:01 NA
3 3 2017-01-30 08:00:02 1
4 4 2017-01-30 08:00:03 1
5 5 2017-01-30 08:00:04 1
6 6 2017-01-30 08:00:05 1
7 7 2017-01-30 08:00:06 1
8 8 2017-01-30 08:00:07 1
9 9 2017-01-30 08:00:08 NA
10 10 2017-01-30 08:00:09 NA
11 11 2017-01-30 08:00:10 2
12 12 2017-01-30 08:00:11 2
13 13 2017-01-30 08:00:12 2
14 14 2017-01-30 08:00:13 2
15 15 2017-01-30 08:00:14 2
16 16 2017-01-30 08:00:15 NA
17 17 2017-01-30 08:00:16 3
18 18 2017-01-30 08:00:17 3
19 19 2017-01-30 08:00:18 3
20 20 2017-01-30 08:00:19 NA
这里有一个base R
选项。我们split
数据集的行序列'trx_id'(一个OP显示为输入数据),得到序列(seq
),stack
它到两列数据集并根据 'values' 作为来自 'd1'
d1 <- stack(lapply(split(seq_len(nrow(df1)), df1$trx_id), function(x) seq(x[1], x[2])))
df1$trx_id[d1$values] <- d1$ind
df1$trx_id
#[1] NA NA 1 1 1 1 1 1 NA NA 2 2 2 2 2 NA 3 3 3 NA
非 tidyr 方法,其中 x 是您的 ID 列:
x <- c(NA,NA, 1,NA,NA,1, NA, NA, 2, NA, NA,2, NA, 3,NA, NA,3,NA)
timestamps <- paste(unique(x)[!is.na(unique(x))], collapse = "|")
timestamps <- grep(timestamps, x)
timestamps <- matrix(timestamps, ncol = 2, byrow=TRUE)
xmatrix <- apply(timestamps, MARGIN = 1, FUN = function(i) {
y <- x[i[1]:i[2]]
y[is.na(y)] <- x[i][1]
x[i[1]:i[2]] <- y
return(x)
})
(x <- apply(xmatrix, 1,FUN = function(z) {
ifelse(all(is.na(z)), NA, max(z, na.rm=TRUE))
}))
## [1] NA NA 1 1 1 1 NA NA 2 2 2 2 NA 3 3 3 3 NA
HTH