删除第一次出现的重复列名 data.table

remove first occurrence of duplicate column names data.table

删除具有重复名称的列的最简洁方法是什么,但我想保留第二次出现的列(或另一种方法,删除第一次出现的列)?

鉴于:

library(data.table)
dt <- structure(list(CERT_NUMBER = c(999, NA, NA), FORENAME = c("JOHN", 
                                                                NA, NA), SURNAME = c("JOHNSON", NA, NA), START_DATE = structure(c(16801L, 
                                                                                                                                  NA, NA), class = c("IDate", "Date")), EXPIRY_DATE = structure(c(17166L, 
                                                                                                                                                                                                  NA, NA), class = c("IDate", "Date")), ID = c(1, 2, 3), FORENAME = c("JOHN", 
                                                                                                                                                                                                                                                                      "JACK", "ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"), 
                     MONTH = structure(c(16953L, 16953L, 16953L), class = c("IDate", 
                                                                            "Date"))), row.names = c(NA, -3L), class = c("data.table", 
                                                                                                                         "data.frame"))
dt
#    CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
# 1:         999     JOHN JOHNSON 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
# 2:          NA     <NA>    <NA>       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
# 3:          NA     <NA>    <NA>       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

我想保留第二次出现的重复列名:

#    CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
# 1:         999 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
# 2:          NA       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
# 3:          NA       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

如果我们不关心重复项的顺序,我们可以执行以下操作以保留第一个重复项,这不是我想要的:

dt[, .SD, .SDcols = unique(names(dt))]
#    CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID      MONTH
# 1:         999     JOHN JOHNSON 2016-01-01  2016-12-31  1 2016-06-01
# 2:          NA     <NA>    <NA>       <NA>        <NA>  2 2016-06-01
# 3:          NA     <NA>    <NA>       <NA>        <NA>  3 2016-06-01

谢谢

如果重复的列只重复了 2 次,您可以尝试 duplicated() 使用 fromlast=TRUE 参数:

dt[, .SD, .SDcols = ! duplicated(colnames(dt),fromLast=TRUE)]

   CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
1:         999 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
2:          NA       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
3:          NA       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

这里有一个更灵活的方式:

g <- as.integer(ave(names(dt), names(dt), FUN = length))

# for duplicated column names, keep the 1st occurrence
dt[, g == 1 | (rowid(names(dt)) == 1), with = FALSE]

# keep the 2nd occurrence
dt[, g == 1 | (rowid(names(dt)) == 2), with = FALSE]

# keep the 2nd and 3rd occurrences
dt[, g == 1 | (rowid(names(dt)) %in% c(2, 3)), with = FALSE]

# keep the last occurrence
dt[, g == rowid(names(dt)), with = FALSE]