通过匹配多个行删除 R 中的数据框行
Remove a data frame row in R with a match over multiple Rows
我有这样的数据框:
content ChatPosition
This is a start line START
This is a middle line MIDDLE
This is a middle line MIDDLE
This is the last line END
This is a start line with a subsequent middle or end START
This is another start line without a middle or an end START
This is a start line START
This is a middle line MIDDLE
This is the last line END
content <- c("This is a start line" , "This is a middle line" , "This is a middle line" ,"This is the last line" ,
"This is a start line with a subsequent middle or end" , "This is another start line without a middle or an end" ,
"This is a start line" , "This is a middle line" , "This is the last line")
ChatPosition <- c("START" , "MIDDLE" , "MIDDLE" , "END" , "START" ,"START" , "START" ,"MIDDLE" , "END")
df <- data.frame(content, ChatPosition)
我想删除包含开头的行,但前提是下一行在 ChatPosition 列中不包含 MIDDLE 或 END。
content ChatPosition
This is a start line START
This is a middle line MIDDLE
This is a middle line MIDDLE
This is the last line END
This is a start line START
This is a middle line MIDDLE
This is the last line END
nrow(df)
jjj <- 0
for(jjj in 1:nrow(df))
{
# Check of a match of two STARTS over over multiple lines.
if (df$ChatPosition[jjj]=="START" && df$ChatPosition[jjj+1]=="START")
{
print(df$content[jjj])
}
}
我能够使用上面的代码打印出我想删除的两行我想知道删除这些行的最优雅的解决方案是什么?
还有一个 for with nested 如果这里的方法是正确的,或者是否有一个库可以更容易地做这种事情?
问候
乔纳森
这应该适合你。
df[!(as.character(df$ChatPosition) == "START" &
c(tail(as.character(df$ChatPosition), -1), "END") == "START"), ]
content ChatPosition
1 This is a start line START
2 This is a middle line MIDDLE
3 This is a middle line MIDDLE
4 This is the last line END
7 This is a start line START
8 This is a middle line MIDDLE
9 This is the last line END
[]
returns 中的第一个参数是一个逻辑向量,它告诉 R 要保留哪些行。我用tail(, -1)
得到下一个观察值df$ChatPosition
来做比较。请注意,需要将 df$ChatPosition
转换为第二行中的字符,以便在最终位置连接 "END",因为 df$ChatPosition
是一个因子变量。
使用grep
。您可以将此解决方案与真实数据集上的 for 循环进行比较以获得速度
start_indices = grep("START",ChatPosition)
end_indices = grep("END",ChatPosition)
match_indices = sapply(end_indices,function(x) tail(start_indices[(start_indices-x)<0],1) )
match_indices
# [1] 1 7
del_indices = setdiff(start_indices,match_indices)
del_indices
# [1] 5 6
DF_subset = DF[-del_indices,]
DF_subset
# content ChatPosition
# 1 This is a start line START
# 2 This is a middle line MIDDLE
# 3 This is a middle line MIDDLE
# 4 This is the last line END
# 7 This is a start line START
# 8 This is a middle line MIDDLE
# 9 This is the last line END
另一种选择:
library(dplyr)
filter(df, !(ChatPosition == "START" & lead(ChatPosition) == "START"))
给出:
# content ChatPosition
#1 This is a start line START
#2 This is a middle line MIDDLE
#3 This is a middle line MIDDLE
#4 This is the last line END
#5 This is a start line START
#6 This is a middle line MIDDLE
#7 This is the last line END
我有这样的数据框:
content ChatPosition
This is a start line START
This is a middle line MIDDLE
This is a middle line MIDDLE
This is the last line END
This is a start line with a subsequent middle or end START
This is another start line without a middle or an end START
This is a start line START
This is a middle line MIDDLE
This is the last line END
content <- c("This is a start line" , "This is a middle line" , "This is a middle line" ,"This is the last line" ,
"This is a start line with a subsequent middle or end" , "This is another start line without a middle or an end" ,
"This is a start line" , "This is a middle line" , "This is the last line")
ChatPosition <- c("START" , "MIDDLE" , "MIDDLE" , "END" , "START" ,"START" , "START" ,"MIDDLE" , "END")
df <- data.frame(content, ChatPosition)
我想删除包含开头的行,但前提是下一行在 ChatPosition 列中不包含 MIDDLE 或 END。
content ChatPosition
This is a start line START
This is a middle line MIDDLE
This is a middle line MIDDLE
This is the last line END
This is a start line START
This is a middle line MIDDLE
This is the last line END
nrow(df)
jjj <- 0
for(jjj in 1:nrow(df))
{
# Check of a match of two STARTS over over multiple lines.
if (df$ChatPosition[jjj]=="START" && df$ChatPosition[jjj+1]=="START")
{
print(df$content[jjj])
}
}
我能够使用上面的代码打印出我想删除的两行我想知道删除这些行的最优雅的解决方案是什么?
还有一个 for with nested 如果这里的方法是正确的,或者是否有一个库可以更容易地做这种事情?
问候 乔纳森
这应该适合你。
df[!(as.character(df$ChatPosition) == "START" &
c(tail(as.character(df$ChatPosition), -1), "END") == "START"), ]
content ChatPosition
1 This is a start line START
2 This is a middle line MIDDLE
3 This is a middle line MIDDLE
4 This is the last line END
7 This is a start line START
8 This is a middle line MIDDLE
9 This is the last line END
[]
returns 中的第一个参数是一个逻辑向量,它告诉 R 要保留哪些行。我用tail(, -1)
得到下一个观察值df$ChatPosition
来做比较。请注意,需要将 df$ChatPosition
转换为第二行中的字符,以便在最终位置连接 "END",因为 df$ChatPosition
是一个因子变量。
使用grep
。您可以将此解决方案与真实数据集上的 for 循环进行比较以获得速度
start_indices = grep("START",ChatPosition)
end_indices = grep("END",ChatPosition)
match_indices = sapply(end_indices,function(x) tail(start_indices[(start_indices-x)<0],1) )
match_indices
# [1] 1 7
del_indices = setdiff(start_indices,match_indices)
del_indices
# [1] 5 6
DF_subset = DF[-del_indices,]
DF_subset
# content ChatPosition
# 1 This is a start line START
# 2 This is a middle line MIDDLE
# 3 This is a middle line MIDDLE
# 4 This is the last line END
# 7 This is a start line START
# 8 This is a middle line MIDDLE
# 9 This is the last line END
另一种选择:
library(dplyr)
filter(df, !(ChatPosition == "START" & lead(ChatPosition) == "START"))
给出:
# content ChatPosition
#1 This is a start line START
#2 This is a middle line MIDDLE
#3 This is a middle line MIDDLE
#4 This is the last line END
#5 This is a start line START
#6 This is a middle line MIDDLE
#7 This is the last line END