如何只删除连续的重复行?
How to delete only consecutive duplicate rows?
我需要删除数据框中的所有重复项,只有当它们出现在连续的行中时。我尝试了 distinct() 函数,但它删除了所有重复项 - 所以我需要一个不同的代码,让我有机会自定义并说仅当重复项是连续的并且仅针对特定列时才删除。
这是我的数据示例:
Subject Trial Event_type Code Time
23 VP02_RP 15 Picture face01_n 887969
24 VP02_RP 15 Sound mpossound_test5 888260
25 VP02_RP 15 Picture pospic_test5 906623
26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
27 VP02_RP 15 Response 15 958962
28 VP02_RP 18 Picture face01_p 987666
29 VP02_RP 18 Sound mpossound_test6 987668
30 VP02_RP 18 Picture negpic_test6 1006031
31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
32 VP02_RP 18 Response 15 1076642
33 VP02_RP 19 Response 13 1680887
正如您在第 32 和 33 行中看到的,我有两个连续的回复,我只想保留第一个。所以我想删除 Event_type 列中所有重复的连续行。
我该怎么办?
您可以使用 data.table
中的 rleid
函数,它将为每个连续的事件值提供一个唯一的数字,然后使用 duplicated
只保留第一个。
res <- df[!duplicated(data.table::rleid(df$Event_type)), ]
res
# Subject Trial Event_type Code Time
#23 VP02_RP 15 Picture face01_n 887969
#24 VP02_RP 15 Sound mpossound_test5 888260
#25 VP02_RP 15 Picture pospic_test5 906623
#26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#27 VP02_RP 15 Response 15 958962
#28 VP02_RP 18 Picture face01_p 987666
#29 VP02_RP 18 Sound mpossound_test6 987668
#30 VP02_RP 18 Picture negpic_test6 1006031
#31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#32 VP02_RP 18 Response 15 1076642
rleid
基数 R 中的函数可以写成 rle
-
res <- df[!duplicated(with(rle(df$Event_type),rep(seq_along(values), lengths))),]
res
一个潜在的 tidyverse 解决方案:
library(tidyverse)
df1 <- data.frame(
stringsAsFactors = FALSE,
row.names = c("23","24","25","26","27",
"28","29","30","31","32","33"),
Subject = c("VP02_RP","VP02_RP","VP02_RP",
"VP02_RP","VP02_RP","VP02_RP","VP02_RP","VP02_RP",
"VP02_RP","VP02_RP","VP02_RP"),
Trial = c(15L, 15L, 15L, 15L, 15L, 18L, 18L, 18L, 18L, 18L, 19L),
Event_type = c("Picture","Sound","Picture",
"Nothing","Response","Picture","Sound","Picture",
"Nothing","Response","Response"),
Code = c("face01_n","mpossound_test5",
"pospic_test5","ev_mnegpos_adj_onset","15","face01_p",
"mpossound_test6","negpic_test6",
"ev_mposnegpos_adj_onset","15","13"),
Time = c(887969L,888260L,906623L,
928623L,958962L,987666L,987668L,1006031L,1028031L,
1076642L,1680887L)
)
df1 %>%
filter(Event_type != lag(Event_type, 1))
#> Subject Trial Event_type Code Time
#> 24 VP02_RP 15 Sound mpossound_test5 888260
#> 25 VP02_RP 15 Picture pospic_test5 906623
#> 26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#> 27 VP02_RP 15 Response 15 958962
#> 28 VP02_RP 18 Picture face01_p 987666
#> 29 VP02_RP 18 Sound mpossound_test6 987668
#> 30 VP02_RP 18 Picture negpic_test6 1006031
#> 31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#> 32 VP02_RP 18 Response 15 1076642
选项data.table
library(data.table)
setDT(df1)[Event_type != shift(Event_type)]
我需要删除数据框中的所有重复项,只有当它们出现在连续的行中时。我尝试了 distinct() 函数,但它删除了所有重复项 - 所以我需要一个不同的代码,让我有机会自定义并说仅当重复项是连续的并且仅针对特定列时才删除。
这是我的数据示例:
Subject Trial Event_type Code Time
23 VP02_RP 15 Picture face01_n 887969
24 VP02_RP 15 Sound mpossound_test5 888260
25 VP02_RP 15 Picture pospic_test5 906623
26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
27 VP02_RP 15 Response 15 958962
28 VP02_RP 18 Picture face01_p 987666
29 VP02_RP 18 Sound mpossound_test6 987668
30 VP02_RP 18 Picture negpic_test6 1006031
31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
32 VP02_RP 18 Response 15 1076642
33 VP02_RP 19 Response 13 1680887
正如您在第 32 和 33 行中看到的,我有两个连续的回复,我只想保留第一个。所以我想删除 Event_type 列中所有重复的连续行。
我该怎么办?
您可以使用 data.table
中的 rleid
函数,它将为每个连续的事件值提供一个唯一的数字,然后使用 duplicated
只保留第一个。
res <- df[!duplicated(data.table::rleid(df$Event_type)), ]
res
# Subject Trial Event_type Code Time
#23 VP02_RP 15 Picture face01_n 887969
#24 VP02_RP 15 Sound mpossound_test5 888260
#25 VP02_RP 15 Picture pospic_test5 906623
#26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#27 VP02_RP 15 Response 15 958962
#28 VP02_RP 18 Picture face01_p 987666
#29 VP02_RP 18 Sound mpossound_test6 987668
#30 VP02_RP 18 Picture negpic_test6 1006031
#31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#32 VP02_RP 18 Response 15 1076642
rleid
基数 R 中的函数可以写成 rle
-
res <- df[!duplicated(with(rle(df$Event_type),rep(seq_along(values), lengths))),]
res
一个潜在的 tidyverse 解决方案:
library(tidyverse)
df1 <- data.frame(
stringsAsFactors = FALSE,
row.names = c("23","24","25","26","27",
"28","29","30","31","32","33"),
Subject = c("VP02_RP","VP02_RP","VP02_RP",
"VP02_RP","VP02_RP","VP02_RP","VP02_RP","VP02_RP",
"VP02_RP","VP02_RP","VP02_RP"),
Trial = c(15L, 15L, 15L, 15L, 15L, 18L, 18L, 18L, 18L, 18L, 19L),
Event_type = c("Picture","Sound","Picture",
"Nothing","Response","Picture","Sound","Picture",
"Nothing","Response","Response"),
Code = c("face01_n","mpossound_test5",
"pospic_test5","ev_mnegpos_adj_onset","15","face01_p",
"mpossound_test6","negpic_test6",
"ev_mposnegpos_adj_onset","15","13"),
Time = c(887969L,888260L,906623L,
928623L,958962L,987666L,987668L,1006031L,1028031L,
1076642L,1680887L)
)
df1 %>%
filter(Event_type != lag(Event_type, 1))
#> Subject Trial Event_type Code Time
#> 24 VP02_RP 15 Sound mpossound_test5 888260
#> 25 VP02_RP 15 Picture pospic_test5 906623
#> 26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#> 27 VP02_RP 15 Response 15 958962
#> 28 VP02_RP 18 Picture face01_p 987666
#> 29 VP02_RP 18 Sound mpossound_test6 987668
#> 30 VP02_RP 18 Picture negpic_test6 1006031
#> 31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#> 32 VP02_RP 18 Response 15 1076642
选项data.table
library(data.table)
setDT(df1)[Event_type != shift(Event_type)]