R 列比较和筛选
R column comparison and filter
我有一个看起来像这样的数据框,其中列名作为日期;
2013_11 | 2013_12 | 2014_01 | 2014_02 | 2014_03 |
NA | NA | 3 | 3 | N |
2 | 2 | 3 | NA | NA |
NA | NA | NA | NA | NA |
我需要编写某种逻辑函数来过滤掉我要查找的行。我只需要拉出 2013 年任何一个月都没有数字的行(前两列),但 DID 在 2014 年的任何一列中至少有 1 个数字。
所以代码只会为我拉回第一行;
NA | NA | 3 | 3 | N |
我想不出最有效的方法,因为我有大约 800 万行。
你可以试试
indx1 <- grep('2013', colnames(df))
indx2 <- grep('2014', colnames(df))
df[!rowSums(!is.na(df[indx1]))&!!rowSums(!is.na(df[indx2])),]
# 2013_11 2013_12 2014_01 2014_02 2014_03
#1 NA NA 3 3 N
或者您可以使用
i1 <- Reduce(`&`, lapply(df[indx1], function(x) is.na(x)))
i2 <- Reduce(`&`, lapply(df[indx2], function(x) !is.na(x)))
df[i1 &i2,]
# 2013_11 2013_12 2014_01 2014_02 2014_03
#1 NA NA 3 3 N
数据
df <- structure(list(`2013_11` = c(NA, 2L, NA), `2013_12` = c(NA, 2L,
NA), `2014_01` = c(3L, 3L, NA), `2014_02` = c(3L, NA, NA), `2014_03` = c("N",
NA, NA)), .Names = c("2013_11", "2013_12", "2014_01", "2014_02",
"2014_03"), class = "data.frame", row.names = c(NA, -3L))
您是否考虑过使用 grep。我会创建一个函数来执行此操作,如下所示。在 for
循环中使用 R 的 any
、all
、is.na
和 if
语句。
grep_function <- function(src, condition1, condition2) {
for(i in 1:length(src[[1]])){
data_condition1 <- src[i, grepl(condition1, names(src))]
data_condition2 <- src[i, grepl(condition2, names(src))]
if(all(is.na(data_condition1) && any(!is.na(data_condition2)))) {
// do something here to each individual observation
} else {
// do something for those that do not meet your criterea
}
}
}
示例:grep_function(your-data-here, "2013", "2014")
或者您可以使用 SQL(它有点冗长,但对某些人来说可能更易读):
require('sqldf')
a=data.frame("2013_11"=c(NA,2,NA), "2013_12"=c(NA,2,NA), "2014_01" =c(3,3,NA),
"2014_02" =c(3,NA,NA) ,"2014_03" =c(NA,NA,NA))
sqldf("select * from a where
case when X2013_11 is null then 0 else 1 end +
case when X2013_12 is null then 0 else 1 end = 0
and
case when X2014_01 is null then 0 else 1 end +
case when X2014_02 is null then 0 else 1 end +
case when X2014_03 is null then 0 else 1 end > 0
")
X2013_11 X2013_12 X2014_01 X2014_02 X2014_03
NA NA 3 3 NA
我有一个看起来像这样的数据框,其中列名作为日期;
2013_11 | 2013_12 | 2014_01 | 2014_02 | 2014_03 |
NA | NA | 3 | 3 | N |
2 | 2 | 3 | NA | NA |
NA | NA | NA | NA | NA |
我需要编写某种逻辑函数来过滤掉我要查找的行。我只需要拉出 2013 年任何一个月都没有数字的行(前两列),但 DID 在 2014 年的任何一列中至少有 1 个数字。
所以代码只会为我拉回第一行;
NA | NA | 3 | 3 | N |
我想不出最有效的方法,因为我有大约 800 万行。
你可以试试
indx1 <- grep('2013', colnames(df))
indx2 <- grep('2014', colnames(df))
df[!rowSums(!is.na(df[indx1]))&!!rowSums(!is.na(df[indx2])),]
# 2013_11 2013_12 2014_01 2014_02 2014_03
#1 NA NA 3 3 N
或者您可以使用
i1 <- Reduce(`&`, lapply(df[indx1], function(x) is.na(x)))
i2 <- Reduce(`&`, lapply(df[indx2], function(x) !is.na(x)))
df[i1 &i2,]
# 2013_11 2013_12 2014_01 2014_02 2014_03
#1 NA NA 3 3 N
数据
df <- structure(list(`2013_11` = c(NA, 2L, NA), `2013_12` = c(NA, 2L,
NA), `2014_01` = c(3L, 3L, NA), `2014_02` = c(3L, NA, NA), `2014_03` = c("N",
NA, NA)), .Names = c("2013_11", "2013_12", "2014_01", "2014_02",
"2014_03"), class = "data.frame", row.names = c(NA, -3L))
您是否考虑过使用 grep。我会创建一个函数来执行此操作,如下所示。在 for
循环中使用 R 的 any
、all
、is.na
和 if
语句。
grep_function <- function(src, condition1, condition2) {
for(i in 1:length(src[[1]])){
data_condition1 <- src[i, grepl(condition1, names(src))]
data_condition2 <- src[i, grepl(condition2, names(src))]
if(all(is.na(data_condition1) && any(!is.na(data_condition2)))) {
// do something here to each individual observation
} else {
// do something for those that do not meet your criterea
}
}
}
示例:grep_function(your-data-here, "2013", "2014")
或者您可以使用 SQL(它有点冗长,但对某些人来说可能更易读):
require('sqldf')
a=data.frame("2013_11"=c(NA,2,NA), "2013_12"=c(NA,2,NA), "2014_01" =c(3,3,NA),
"2014_02" =c(3,NA,NA) ,"2014_03" =c(NA,NA,NA))
sqldf("select * from a where
case when X2013_11 is null then 0 else 1 end +
case when X2013_12 is null then 0 else 1 end = 0
and
case when X2014_01 is null then 0 else 1 end +
case when X2014_02 is null then 0 else 1 end +
case when X2014_03 is null then 0 else 1 end > 0
")
X2013_11 X2013_12 X2014_01 X2014_02 X2014_03
NA NA 3 3 NA