R按符合条件的行提取第一个单元格
R extract first cell by row that meets a criteria
我的数据框中的列按相关性排序,左边的列是最相关的。我正在尝试提取以 'D'.
开头的最相关的项目
这是一个例子:
df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L,
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269",
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L,
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431",
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L,
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333",
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L,
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703",
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L,
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"),
TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1",
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA,
10L), class = "data.frame")
> df
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1 K704 R18 F102 E834
2 D127 D123 Z720
3 K701 K703 K658 K766 F102 E877
4 K922 D649 K296 K703 F101 Z720
5 E1133 H431 H333
6 D123 D123
7 H269 H570
8 H269 H570
9 R0989
10 E611
结果向量在没有匹配项时应报告 NA,在有匹配项时报告第一个(最左边)项。我可以找到我感兴趣的项目...但是,我在为每一行提取第一个(最左边)时遇到困难。
> sapply(df, function (x) grepl("D", x))
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
结果应该是:
c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)
编辑:
如果我希望模式为 c("D", "K"),这将如何扩展?我收到一条错误消息,说它只需要第一个。 (答案:将模式更改为 "D|K")
编辑 2:此外,当我想找到每行最左边的 "D" 代码但从预先指定的列表中排除代码(例如排除 c("D123", "D090", "D111")?
编辑 3:我编写了一个包含所有答案的小函数。它适用于我正在做的事情。也许在某个阶段它可能会使其他人受益。
函数:
FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA, Exact.excl=T, Return=c("01", "FirstValue", "Count")){
if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}
out <- NA
if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
if(Return[1] == "Count"){out <- rowSums(hits)}
if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
out
}
函数需要数据框或列表作为输入。然后是要查找的内容、要排除的内容以及这些是否应该完全匹配的向量。最后,它可以 return 第一个(最左边的)匹配项,所有匹配项的计数,或者如果找到任何结果匹配项则只是一个标志。
示例 1. 在 df 中查找任何以 D 或 K 开头的代码(不限于完全匹配),但排除 K701、K703 和 D127(完全匹配这些),return 第一个(最左边)值:
FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"), Exact.excl=T, Return="FirstValue")
示例2.在df中查找以D或H开头的任何代码(不限于完全匹配),但排除任何包含H3(none完全匹配)和[=47=的代码] 第一个(最左边)值:
FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"), Exact.excl=F, Return="FirstValue")
由于您是按行操作的,因此您需要 apply
,而不是 sapply
此外,在 grep
中使用 value = TRUE
参数将 return 您要查找的实际字符串
> apply(df, 1, function(x) grep("D", x, value=T)[1])
1 2 3 4 5 6 7 8 9 10
NA "D127" NA "D649" NA "D123" NA NA NA NA
不必 运行 跨越每一行。 运行 grepl
nrow(df)
时间可能比 vapply
慢得多(甚至比表弟 sapply
更慢) 运行ning ncol(df)
次。例如:
hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]
#[1] NA "D127" NA "D649" NA "D123" NA NA NA NA
对一百万行进行基准测试 data.frame。
df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]})
# user system elapsed
# 1.904 0.120 2.024
system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
# user system elapsed
# 23.141 0.172 23.317
我的数据框中的列按相关性排序,左边的列是最相关的。我正在尝试提取以 'D'.
开头的最相关的项目这是一个例子:
df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L,
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269",
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L,
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431",
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L,
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333",
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L,
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703",
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L,
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"),
TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1",
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA,
10L), class = "data.frame")
> df
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1 K704 R18 F102 E834
2 D127 D123 Z720
3 K701 K703 K658 K766 F102 E877
4 K922 D649 K296 K703 F101 Z720
5 E1133 H431 H333
6 D123 D123
7 H269 H570
8 H269 H570
9 R0989
10 E611
结果向量在没有匹配项时应报告 NA,在有匹配项时报告第一个(最左边)项。我可以找到我感兴趣的项目...但是,我在为每一行提取第一个(最左边)时遇到困难。
> sapply(df, function (x) grepl("D", x))
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
结果应该是:
c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)
编辑: 如果我希望模式为 c("D", "K"),这将如何扩展?我收到一条错误消息,说它只需要第一个。 (答案:将模式更改为 "D|K")
编辑 2:此外,当我想找到每行最左边的 "D" 代码但从预先指定的列表中排除代码(例如排除 c("D123", "D090", "D111")?
编辑 3:我编写了一个包含所有答案的小函数。它适用于我正在做的事情。也许在某个阶段它可能会使其他人受益。
函数:
FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA, Exact.excl=T, Return=c("01", "FirstValue", "Count")){
if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}
out <- NA
if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
if(Return[1] == "Count"){out <- rowSums(hits)}
if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
out
}
函数需要数据框或列表作为输入。然后是要查找的内容、要排除的内容以及这些是否应该完全匹配的向量。最后,它可以 return 第一个(最左边的)匹配项,所有匹配项的计数,或者如果找到任何结果匹配项则只是一个标志。
示例 1. 在 df 中查找任何以 D 或 K 开头的代码(不限于完全匹配),但排除 K701、K703 和 D127(完全匹配这些),return 第一个(最左边)值:
FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"), Exact.excl=T, Return="FirstValue")
示例2.在df中查找以D或H开头的任何代码(不限于完全匹配),但排除任何包含H3(none完全匹配)和[=47=的代码] 第一个(最左边)值:
FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"), Exact.excl=F, Return="FirstValue")
由于您是按行操作的,因此您需要 apply
,而不是 sapply
此外,在 grep
中使用 value = TRUE
参数将 return 您要查找的实际字符串
> apply(df, 1, function(x) grep("D", x, value=T)[1])
1 2 3 4 5 6 7 8 9 10
NA "D127" NA "D649" NA "D123" NA NA NA NA
不必 运行 跨越每一行。 运行 grepl
nrow(df)
时间可能比 vapply
慢得多(甚至比表弟 sapply
更慢) 运行ning ncol(df)
次。例如:
hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]
#[1] NA "D127" NA "D649" NA "D123" NA NA NA NA
对一百万行进行基准测试 data.frame。
df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]})
# user system elapsed
# 1.904 0.120 2.024
system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
# user system elapsed
# 23.141 0.172 23.317