R data.fame 操作:在特定列后转换为 NA
R data.fame manipulation: convert to NA after specific column
我有一个很大的 data.frame
,我需要一些基于行的转换。我的目的是在列中有特定字符后将行中的所有值转换为 NA
。
例如,我从我的真实数据集中提供了很少的样本:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
以sample_df
为例
首先,我想在 "I"
之后将所有值都转换为 NA
Sample data.frames
我尝试了 base
、dpylr
、purrr
但无法创建算法。
感谢您的帮助。
试试这个:
查找 "I" 个值
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
从第一个 "I" 看到的位置开始查找位置
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
替换需要的值
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
你的输出
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
示例 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
这是一种蛮力方法,应该是最容易想到但最不受欢迎的方法。无论如何,这里是:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
这是使用 plyr
包
中的 ddply
的可能答案
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
plyr
方法:
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
您必须使用 if
,因为 x[min(which(x == "I"))]
会 returns numeric(0)
对于没有至少一个 I
的行
我的解决方案:
根据@Julien Navarre 的推荐,首先我创建了toNA()
函数:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
首先,我尝试了 plyr::adply()
和 purrrlyr::by_row()
函数来应用我的 toNA()
函数,我的 data.frame 有超过 300 万行。
两者都很慢。 (对于 1000 行,它们分别需要 9 秒和 6 秒)。使用简单的 function(x) x
,这些方法也很慢。我不确定开销是多少。
所以我尝试了base::apply()
函数:(result
是我的数据集)
as.tibble(t(apply(result, 1, toNA ) ))
1000行只需要0.2秒。
我不确定编程风格,但目前这个解决方案适合我。
感谢您的所有建议。
一个纯粹的基础解决方案,我们正在构建一个布尔矩阵“=="I"
or not”,然后通过逐行的双倍累加,我们可以找到我们的 NAs
必须放置的位置:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V
我有一个很大的 data.frame
,我需要一些基于行的转换。我的目的是在列中有特定字符后将行中的所有值转换为 NA
。
例如,我从我的真实数据集中提供了很少的样本:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
以sample_df
为例
首先,我想在 "I"
Sample data.frames
我尝试了 base
、dpylr
、purrr
但无法创建算法。
感谢您的帮助。
试试这个:
查找 "I" 个值
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
从第一个 "I" 看到的位置开始查找位置
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
替换需要的值
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
你的输出
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
示例 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
这是一种蛮力方法,应该是最容易想到但最不受欢迎的方法。无论如何,这里是:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
这是使用 plyr
包
ddply
的可能答案
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
plyr
方法:
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
您必须使用 if
,因为 x[min(which(x == "I"))]
会 returns numeric(0)
对于没有至少一个 I
我的解决方案:
根据@Julien Navarre 的推荐,首先我创建了toNA()
函数:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
首先,我尝试了 plyr::adply()
和 purrrlyr::by_row()
函数来应用我的 toNA()
函数,我的 data.frame 有超过 300 万行。
两者都很慢。 (对于 1000 行,它们分别需要 9 秒和 6 秒)。使用简单的 function(x) x
,这些方法也很慢。我不确定开销是多少。
所以我尝试了base::apply()
函数:(result
是我的数据集)
as.tibble(t(apply(result, 1, toNA ) ))
1000行只需要0.2秒。
我不确定编程风格,但目前这个解决方案适合我。
感谢您的所有建议。
一个纯粹的基础解决方案,我们正在构建一个布尔矩阵“=="I"
or not”,然后通过逐行的双倍累加,我们可以找到我们的 NAs
必须放置的位置:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V