R按符合条件的行提取第一个单元格

Question

我的数据框中的列按相关性排序，左边的列是最相关的。我正在尝试提取以 'D'.

开头的最相关的项目

这是一个例子：

df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L, 
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269", 
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L, 
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431", 
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L, 
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333", 
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L, 
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703", 
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"), 
    TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1", 
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA, 
10L), class = "data.frame")


    > df
   TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1    K704    R18   F102   E834              
2    D127   D123   Z720                     
3    K701   K703   K658   K766   F102   E877
4    K922   D649   K296   K703   F101   Z720
5   E1133   H431   H333                     
6    D123   D123                            
7    H269   H570                            
8    H269   H570                            
9   R0989                                   
10   E611

结果向量在没有匹配项时应报告 NA，在有匹配项时报告第一个（最左边）项。我可以找到我感兴趣的项目...但是，我在为每一行提取第一个（最左边）时遇到困难。

> sapply(df,  function (x) grepl("D", x))
      TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
 [1,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [2,]   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE
 [3,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [4,]  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE
 [5,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [6,]   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE
 [7,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [8,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [9,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
[10,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE

结果应该是：

c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)

编辑：如果我希望模式为 c("D", "K")，这将如何扩展？我收到一条错误消息，说它只需要第一个。（答案：将模式更改为 "D|K"）

编辑 2：此外，当我想找到每行最左边的 "D" 代码但从预先指定的列表中排除代码（例如排除 c("D123", "D090", "D111")?

编辑 3：我编写了一个包含所有答案的小函数。它适用于我正在做的事情。也许在某个阶段它可能会使其他人受益。

函数：

FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA,  Exact.excl=T, Return=c("01", "FirstValue", "Count")){ 
  if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
  if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
  if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}

  out <- NA
  if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
  if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
  if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
  if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
  if(Return[1] == "Count"){out <- rowSums(hits)}
  if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
  out
}

函数需要数据框或列表作为输入。然后是要查找的内容、要排除的内容以及这些是否应该完全匹配的向量。最后，它可以 return 第一个（最左边的）匹配项，所有匹配项的计数，或者如果找到任何结果匹配项则只是一个标志。

示例 1. 在 df 中查找任何以 D 或 K 开头的代码（不限于完全匹配），但排除 K701、K703 和 D127（完全匹配这些），return 第一个（最左边）值：

FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"),  Exact.excl=T, Return="FirstValue")

示例2.在df中查找以D或H开头的任何代码（不限于完全匹配），但排除任何包含H3（none完全匹配）和[=47=的代码] 第一个（最左边）值：

FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"),  Exact.excl=F, Return="FirstValue")

Answer 1

由于您是按行操作的，因此您需要 apply，而不是 sapply

此外，在 grep 中使用 value = TRUE 参数将 return 您要查找的实际字符串

> apply(df, 1, function(x) grep("D", x, value=T)[1])
     1      2      3      4      5      6      7      8      9     10 
    NA "D127"     NA "D649"     NA "D123"     NA     NA     NA     NA

Answer 2

不必运行跨越每一行。运行 grepl nrow(df) 时间可能比 vapply 慢得多（甚至比表弟 sapply 更慢）运行ning ncol(df)次。例如：

hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
  seq_len(nrow(df)),
  replace(max.col(hits,"first"), rowSums(hits)==0, NA)
  )]
#[1] NA     "D127" NA     "D649" NA     "D123" NA     NA     NA     NA

对一百万行进行基准测试 data.frame。

df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
  seq_len(nrow(df)),
  replace(max.col(hits,"first"), rowSums(hits)==0, NA)
  )]})
#   user  system elapsed 
#  1.904   0.120   2.024 

system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
#   user  system elapsed 
# 23.141   0.172  23.317

R按符合条件的行提取第一个单元格

R extract first cell by row that meets a criteria

r

data-manipulation

apply

sapply