基于行内容的高度选择性过滤
Highly selective filtering based on row contents
我有一个数据集(相当不整洁 - 但不是我的工作......帮助同事),
其中有几行值,其中一些行与一列重复,但其他列由于添加到某些元素的“*”而不同。重复以下:-
a <- c("2020", "Rose", "r","r","s","s","i","i","r")
b <- c("2020", "Rose","r*","r*","s*","s*","s*","s*","s*")
c <- c("2020", "Lily","r","r","s","s","i","i","r")
d <- c("2020", "Tulip","r*","r*","r*","r*","s*","r*","r*")
e <- c("2020", "Tulip","s","s","r","s","s","r","r")
data <- rbind(a,b,c,d,e)
所以我的数据框看起来像这样...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
a "2020" "Rose" "r" "r" "s" "s" "i" "i" "r"
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
e "2020" "Tulip" "s" "s" "r" "s" "s" "r" "r"
我需要删除第 2 列中重复的行(“Rose”、“Lily”等)并有选择地保留带有 * 的行,因此它看起来像这样...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
我觉得与 lapply 捆绑在一起的函数可能是正确的方法,但不知道如何进行!! - 任何想法
你可以试试这个。对于第二个条件 (*s),它只检查第 3 列,因为它们似乎全部或 none.
tbl <- table( data[,2] )
rmv <- names( tbl[ tbl > 1 ] )
data[ !( data[,2] %in% rmv & !grepl("\*",data[,3])), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
万一它必须 select 基于任何 *(至少一个)使用这个
data[ !( data[,2] %in% rmv & apply( data[,3:9], 1, function(x)
any(!grepl("\*",x)) )), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
首先,您谈到了数据框,但到目前为止您使用的是矩阵。那我们先做一个数据框吧
df <- as.data.frame(data)
其次,我们可以使用 by()
,它的工作原理与 lapply(split(x, g), FUN)
基本相同。作为拆分因子,我们使用前两列 1:2
并在每个切片上应用 grepl()
。终于rbind()
.
df <- by(df, df[1:2], \(x) {
if (nrow(x) > 1) {
x[grepl('\*', x$V3), ]
} else x}) |> (\(.) do.call(rbind, .))()
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# c 2020 Lily r r s s i i r
# b 2020 Rose r* r* s* s* s* s* s*
# d 2020 Tulip r* r* r* r* s* r* r*
要清理行名称,请添加:
|> `rownames<-`(NULL)
注:R 版本 4.1.2 (2021-11-01)。
数据:
data <- structure(c("2020", "2020", "2020", "2020", "2020", "Rose", "Rose",
"Lily", "Tulip", "Tulip", "r", "r*", "r", "r*", "s", "r", "r*",
"r", "r*", "s", "s", "s*", "s", "r*", "r", "s", "s*", "s", "r*",
"s", "i", "s*", "i", "s*", "s", "i", "s*", "i", "r*", "r", "r",
"s*", "r", "r*", "r"), .Dim = c(5L, 9L), .Dimnames = list(c("a",
"b", "c", "d", "e"), NULL))
我有一个数据集(相当不整洁 - 但不是我的工作......帮助同事), 其中有几行值,其中一些行与一列重复,但其他列由于添加到某些元素的“*”而不同。重复以下:-
a <- c("2020", "Rose", "r","r","s","s","i","i","r")
b <- c("2020", "Rose","r*","r*","s*","s*","s*","s*","s*")
c <- c("2020", "Lily","r","r","s","s","i","i","r")
d <- c("2020", "Tulip","r*","r*","r*","r*","s*","r*","r*")
e <- c("2020", "Tulip","s","s","r","s","s","r","r")
data <- rbind(a,b,c,d,e)
所以我的数据框看起来像这样...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
a "2020" "Rose" "r" "r" "s" "s" "i" "i" "r"
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
e "2020" "Tulip" "s" "s" "r" "s" "s" "r" "r"
我需要删除第 2 列中重复的行(“Rose”、“Lily”等)并有选择地保留带有 * 的行,因此它看起来像这样...
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
我觉得与 lapply 捆绑在一起的函数可能是正确的方法,但不知道如何进行!! - 任何想法
你可以试试这个。对于第二个条件 (*s),它只检查第 3 列,因为它们似乎全部或 none.
tbl <- table( data[,2] )
rmv <- names( tbl[ tbl > 1 ] )
data[ !( data[,2] %in% rmv & !grepl("\*",data[,3])), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
万一它必须 select 基于任何 *(至少一个)使用这个
data[ !( data[,2] %in% rmv & apply( data[,3:9], 1, function(x)
any(!grepl("\*",x)) )), ]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
b "2020" "Rose" "r*" "r*" "s*" "s*" "s*" "s*" "s*"
c "2020" "Lily" "r" "r" "s" "s" "i" "i" "r"
d "2020" "Tulip" "r*" "r*" "r*" "r*" "s*" "r*" "r*"
首先,您谈到了数据框,但到目前为止您使用的是矩阵。那我们先做一个数据框吧
df <- as.data.frame(data)
其次,我们可以使用 by()
,它的工作原理与 lapply(split(x, g), FUN)
基本相同。作为拆分因子,我们使用前两列 1:2
并在每个切片上应用 grepl()
。终于rbind()
.
df <- by(df, df[1:2], \(x) {
if (nrow(x) > 1) {
x[grepl('\*', x$V3), ]
} else x}) |> (\(.) do.call(rbind, .))()
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# c 2020 Lily r r s s i i r
# b 2020 Rose r* r* s* s* s* s* s*
# d 2020 Tulip r* r* r* r* s* r* r*
要清理行名称,请添加:
|> `rownames<-`(NULL)
注:R 版本 4.1.2 (2021-11-01)。
数据:
data <- structure(c("2020", "2020", "2020", "2020", "2020", "Rose", "Rose",
"Lily", "Tulip", "Tulip", "r", "r*", "r", "r*", "s", "r", "r*",
"r", "r*", "s", "s", "s*", "s", "r*", "r", "s", "s*", "s", "r*",
"s", "i", "s*", "i", "s*", "s", "i", "s*", "i", "r*", "r", "r",
"s*", "r", "r*", "r"), .Dim = c(5L, 9L), .Dimnames = list(c("a",
"b", "c", "d", "e"), NULL))