延长一定长度的运行
Extend runs of certain length
我有一个 640 x 2500 的数据框,其中包含数值和几个 NA
值。我的目标是在每一行中找到至少 75 个连续的 NA
值。对于每个这样的 运行,我也想用 NA
值替换之前的 和 下面的 50 个单元格。
这是一行的缩小示例:
x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
# run of four NA: ^ ^ ^ ^
我想检测连续四个NA
的运行,然后将运行前三个值和后三个值替换为NA
:
c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3)
# ^ ^ ^ ^ ^ ^
我试图先用rle
识别连续的NA
,但是运行宁rle(is.na(df))
给出了错误'x' must be a vector of an atomic type
。即使我 select 单行也会发生这种情况。
不幸的是,我不知道接下来要采取什么步骤将前后 50 个单元格转换为 NA。
在此先感谢您对此提供的任何帮助。
类型强制对我有用:
rle(as.logical(is.na(x[MyRow, ])))
这是我的解决方案。不过,我想知道是否有比我更简洁的解决方案。
library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
as.data.table() # dataset created
# A function to do what you need
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times
if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
Vector[seq(NAs[1]-3,
NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
}
Vector
}
> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 5 4 3 NA NA NA NA 6 9 3 2 4 3
# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 NA NA NA NA NA NA NA NA NA NA 2 4 3
顺便说一句,对于大小为 640*2500 的假设数据帧,速度非常好,其中必须定位 75 个或更多 NA,并且必须用 NA 替换前后的 50 个值。
df <- matrix(nrow = 640,ncol = 2500)
for(i in 1:nrow(df)){
df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times
if(length(NAs >= 75)){ # Check if the condition is met
Vector[seq(NAs[1]-50,
NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
}
Vector
}
# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
as_tibble() %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 75
df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 175
system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose())
user system elapsed
0.216 0.002 0.220
因为您在数据中评论说“一些[行]以几个NA
s开头和结尾”,希望这能更好地代表真实数据:
A B C D E F G H I J
1 1 2 3 NA NA 6 7 8 NA 10
2 1 NA NA NA 5 6 7 NA NA NA
3 1 2 3 4 NA NA NA 8 9 10
假设用NA
扩展的NA
的最小运行长度为2,并且运行之前和之后的两个值应该替换为 NA
。在此示例中,第 2 行代表您在评论中提到的案例。
首先是一些数据争论。我更喜欢使用 long 格式的 data.table
。使用 data.table
我们可以访问有用的常量 .I
和 .N
,并且可以使用 rleid
轻松创建 运行 ID。
# convert data.frame to data.table
library(data.table)
setDT(d)
# set minimum length of runs to be expanded
len = 2L
# set number of values to replace on each side of run
n = 2L
# number of columns of original data (for truncation of indices)
nc = ncol(d)
# create a row index to keep track of the original rows in the long format
d[ , ri := 1:.N]
# melt from wide to long format
d2 = melt(d, id.vars = "ri")
# order by row index
setorder(d2, ri)
现在 运行 及其索引的实际计算:
d2[
# check if the run is an "NA run" and has sufficient length
d2[ , if(anyNA(value) & .N >= len){
# get indices before and after run, where values should be changed to NA
ix = c(.I[1] - n:1L, .I[.N] + 1L:n)
# truncate indices to keep them within (original) rows
ix[ix >= 1 + (ri - 1) * nc & ix <= nc * ri]},
# perform the calculation by row index and run index
# grab replacement indices
by = .(ri, rleid(is.na(value)))]$V1,
# at replacement indices, set value to NA
value := NA]
如果需要,转换回宽格式
dcast(d2, ri ~ variable, value.vars = "value")
# ri A B C D E F G H I J
# 1: 1 1 NA NA NA NA NA NA 8 NA 10
# 2: 2 NA NA NA NA NA NA NA NA NA NA
# 3: 3 1 2 NA NA NA NA NA NA NA 10
我有一个 640 x 2500 的数据框,其中包含数值和几个 NA
值。我的目标是在每一行中找到至少 75 个连续的 NA
值。对于每个这样的 运行,我也想用 NA
值替换之前的 和 下面的 50 个单元格。
这是一行的缩小示例:
x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
# run of four NA: ^ ^ ^ ^
我想检测连续四个NA
的运行,然后将运行前三个值和后三个值替换为NA
:
c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3)
# ^ ^ ^ ^ ^ ^
我试图先用rle
识别连续的NA
,但是运行宁rle(is.na(df))
给出了错误'x' must be a vector of an atomic type
。即使我 select 单行也会发生这种情况。
不幸的是,我不知道接下来要采取什么步骤将前后 50 个单元格转换为 NA。
在此先感谢您对此提供的任何帮助。
类型强制对我有用:
rle(as.logical(is.na(x[MyRow, ])))
这是我的解决方案。不过,我想知道是否有比我更简洁的解决方案。
library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
as.data.table() # dataset created
# A function to do what you need
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times
if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
Vector[seq(NAs[1]-3,
NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
}
Vector
}
> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 5 4 3 NA NA NA NA 6 9 3 2 4 3
# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 NA NA NA NA NA NA NA NA NA NA 2 4 3
顺便说一句,对于大小为 640*2500 的假设数据帧,速度非常好,其中必须定位 75 个或更多 NA,并且必须用 NA 替换前后的 50 个值。
df <- matrix(nrow = 640,ncol = 2500)
for(i in 1:nrow(df)){
df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times
if(length(NAs >= 75)){ # Check if the condition is met
Vector[seq(NAs[1]-50,
NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
}
Vector
}
# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
as_tibble() %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 75
df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 175
system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose())
user system elapsed
0.216 0.002 0.220
因为您在数据中评论说“一些[行]以几个NA
s开头和结尾”,希望这能更好地代表真实数据:
A B C D E F G H I J
1 1 2 3 NA NA 6 7 8 NA 10
2 1 NA NA NA 5 6 7 NA NA NA
3 1 2 3 4 NA NA NA 8 9 10
假设用NA
扩展的NA
的最小运行长度为2,并且运行之前和之后的两个值应该替换为 NA
。在此示例中,第 2 行代表您在评论中提到的案例。
首先是一些数据争论。我更喜欢使用 long 格式的 data.table
。使用 data.table
我们可以访问有用的常量 .I
和 .N
,并且可以使用 rleid
轻松创建 运行 ID。
# convert data.frame to data.table
library(data.table)
setDT(d)
# set minimum length of runs to be expanded
len = 2L
# set number of values to replace on each side of run
n = 2L
# number of columns of original data (for truncation of indices)
nc = ncol(d)
# create a row index to keep track of the original rows in the long format
d[ , ri := 1:.N]
# melt from wide to long format
d2 = melt(d, id.vars = "ri")
# order by row index
setorder(d2, ri)
现在 运行 及其索引的实际计算:
d2[
# check if the run is an "NA run" and has sufficient length
d2[ , if(anyNA(value) & .N >= len){
# get indices before and after run, where values should be changed to NA
ix = c(.I[1] - n:1L, .I[.N] + 1L:n)
# truncate indices to keep them within (original) rows
ix[ix >= 1 + (ri - 1) * nc & ix <= nc * ri]},
# perform the calculation by row index and run index
# grab replacement indices
by = .(ri, rleid(is.na(value)))]$V1,
# at replacement indices, set value to NA
value := NA]
如果需要,转换回宽格式
dcast(d2, ri ~ variable, value.vars = "value")
# ri A B C D E F G H I J
# 1: 1 1 NA NA NA NA NA NA 8 NA 10
# 2: 2 NA NA NA NA NA NA NA NA NA NA
# 3: 3 1 2 NA NA NA NA NA NA NA 10