删除第一次出现的重复列名 data.table
remove first occurrence of duplicate column names data.table
删除具有重复名称的列的最简洁方法是什么,但我想保留第二次出现的列(或另一种方法,删除第一次出现的列)?
鉴于:
library(data.table)
dt <- structure(list(CERT_NUMBER = c(999, NA, NA), FORENAME = c("JOHN",
NA, NA), SURNAME = c("JOHNSON", NA, NA), START_DATE = structure(c(16801L,
NA, NA), class = c("IDate", "Date")), EXPIRY_DATE = structure(c(17166L,
NA, NA), class = c("IDate", "Date")), ID = c(1, 2, 3), FORENAME = c("JOHN",
"JACK", "ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"),
MONTH = structure(c(16953L, 16953L, 16953L), class = c("IDate",
"Date"))), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
dt
# CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
# 1: 999 JOHN JOHNSON 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
# 2: NA <NA> <NA> <NA> <NA> 2 JACK JACKSON 2016-06-01
# 3: NA <NA> <NA> <NA> <NA> 3 ROB ROBINSON 2016-06-01
我想保留第二次出现的重复列名:
# CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
# 1: 999 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
# 2: NA <NA> <NA> 2 JACK JACKSON 2016-06-01
# 3: NA <NA> <NA> 3 ROB ROBINSON 2016-06-01
如果我们不关心重复项的顺序,我们可以执行以下操作以保留第一个重复项,这不是我想要的:
dt[, .SD, .SDcols = unique(names(dt))]
# CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID MONTH
# 1: 999 JOHN JOHNSON 2016-01-01 2016-12-31 1 2016-06-01
# 2: NA <NA> <NA> <NA> <NA> 2 2016-06-01
# 3: NA <NA> <NA> <NA> <NA> 3 2016-06-01
谢谢
如果重复的列只重复了 2 次,您可以尝试 duplicated()
使用 fromlast=TRUE
参数:
dt[, .SD, .SDcols = ! duplicated(colnames(dt),fromLast=TRUE)]
CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
1: 999 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
2: NA <NA> <NA> 2 JACK JACKSON 2016-06-01
3: NA <NA> <NA> 3 ROB ROBINSON 2016-06-01
这里有一个更灵活的方式:
g <- as.integer(ave(names(dt), names(dt), FUN = length))
# for duplicated column names, keep the 1st occurrence
dt[, g == 1 | (rowid(names(dt)) == 1), with = FALSE]
# keep the 2nd occurrence
dt[, g == 1 | (rowid(names(dt)) == 2), with = FALSE]
# keep the 2nd and 3rd occurrences
dt[, g == 1 | (rowid(names(dt)) %in% c(2, 3)), with = FALSE]
# keep the last occurrence
dt[, g == rowid(names(dt)), with = FALSE]
删除具有重复名称的列的最简洁方法是什么,但我想保留第二次出现的列(或另一种方法,删除第一次出现的列)?
鉴于:
library(data.table)
dt <- structure(list(CERT_NUMBER = c(999, NA, NA), FORENAME = c("JOHN",
NA, NA), SURNAME = c("JOHNSON", NA, NA), START_DATE = structure(c(16801L,
NA, NA), class = c("IDate", "Date")), EXPIRY_DATE = structure(c(17166L,
NA, NA), class = c("IDate", "Date")), ID = c(1, 2, 3), FORENAME = c("JOHN",
"JACK", "ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"),
MONTH = structure(c(16953L, 16953L, 16953L), class = c("IDate",
"Date"))), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
dt
# CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
# 1: 999 JOHN JOHNSON 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
# 2: NA <NA> <NA> <NA> <NA> 2 JACK JACKSON 2016-06-01
# 3: NA <NA> <NA> <NA> <NA> 3 ROB ROBINSON 2016-06-01
我想保留第二次出现的重复列名:
# CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
# 1: 999 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
# 2: NA <NA> <NA> 2 JACK JACKSON 2016-06-01
# 3: NA <NA> <NA> 3 ROB ROBINSON 2016-06-01
如果我们不关心重复项的顺序,我们可以执行以下操作以保留第一个重复项,这不是我想要的:
dt[, .SD, .SDcols = unique(names(dt))]
# CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID MONTH
# 1: 999 JOHN JOHNSON 2016-01-01 2016-12-31 1 2016-06-01
# 2: NA <NA> <NA> <NA> <NA> 2 2016-06-01
# 3: NA <NA> <NA> <NA> <NA> 3 2016-06-01
谢谢
如果重复的列只重复了 2 次,您可以尝试 duplicated()
使用 fromlast=TRUE
参数:
dt[, .SD, .SDcols = ! duplicated(colnames(dt),fromLast=TRUE)]
CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
1: 999 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
2: NA <NA> <NA> 2 JACK JACKSON 2016-06-01
3: NA <NA> <NA> 3 ROB ROBINSON 2016-06-01
这里有一个更灵活的方式:
g <- as.integer(ave(names(dt), names(dt), FUN = length))
# for duplicated column names, keep the 1st occurrence
dt[, g == 1 | (rowid(names(dt)) == 1), with = FALSE]
# keep the 2nd occurrence
dt[, g == 1 | (rowid(names(dt)) == 2), with = FALSE]
# keep the 2nd and 3rd occurrences
dt[, g == 1 | (rowid(names(dt)) %in% c(2, 3)), with = FALSE]
# keep the last occurrence
dt[, g == rowid(names(dt)), with = FALSE]