当有多于一组重复的列时删除 table 中的重复列
Removing duplicate columns in table when there are more than one duplicate set of columns
我知道如何处理只有两个重复块时删除重复列的情况,但在我的真实数据中有 3 个或更多。我试图想出一些玩具示例数据集,其中有一组额外的重复列名,我想将其折叠。 dplyr
和 tidyr
是否有类似的简单方法来解决这些问题?
更简单的情况:
structure(list(x = c("a", "a", NA, "a", "a", NA, "a"), y = c(1,
5, NA, 15, 19, NA, 27), z = c(2, 6, NA, 16, 20, NA, 28), x.1 = c("b",
"b", "b", "b", "b", "b", "b"), y.1 = c(3, 7, 11, 17, 21, 23,
29), z.1 = c(4, 8, 12, 18, 22, 24, 30), x.2 = c(NA, NA, "a",
NA, NA, "a", NA), y.2 = c(NA, NA, 13, NA, NA, 25, NA), z.2 = c(NA,
NA, 14, NA, NA, 26, NA)), .Names = c("x", "y", "z", "x.1", "y.1",
"z.1", "x.2", "y.2", "z.2"), row.names = c(NA, -7L), class = "data.frame")
这在 R 中看起来像:
x y z x.1 y.1 z.1 x.2 y.2 z.2
1 a 1 2 b 3 4 <NA> NA NA
2 a 5 6 b 7 8 <NA> NA NA
3 <NA> NA NA b 11 12 a 13 14
4 a 15 16 b 17 18 <NA> NA NA
5 a 19 20 b 21 22 <NA> NA NA
6 <NA> NA NA b 23 24 a 25 26
7 a 27 28 b 29 30 <NA> NA NA
它应该如何处理 dplyr
:
x y z x.1 y.1 z.1
1 a 1 2 b 3 4
2 a 5 6 b 7 8
3 a 13 14 b 11 12
4 a 15 16 b 17 18
5 a 19 20 b 21 22
6 a 25 26 b 23 24
7 a 27 28 b 29 30
硬壳:
structure(list(x = c("a", "b", NA, "a", "a", NA, "a"), y = c(1,
7, 9, 15, 19, NA, 27), z = c(2, 8, 10, 16, 20, NA, 28), x.1 = c("b",
NA, "b", "b", "b", "b", "b"), y.1 = c(3, NA, 11, 17, 21, 23,
29), z.1 = c(4, NA, 12, 18, 22, 24, 30), x.2 = c(NA, "a", "a",
NA, NA, "a", NA), y.2 = c(NA, 5, 13, NA, NA, 25, NA), z.2 = c(NA,
6, 14, NA, NA, 26, NA)), .Names = c("x", "y", "z", "x.1", "y.1",
"z.1", "x.2", "y.2", "z.2"), row.names = c(NA, -7L), class = "data.frame")
这在 R 中看起来像:
x y z x.1 y.1 z.1 x.2 y.2 z.2
1 a 1 2 b 3 4 <NA> NA NA
2 b 7 8 <NA> NA NA a 5 6
3 <NA> 9 10 b 11 12 a 13 14
4 a 15 16 b 17 18 <NA> NA NA
5 a 19 20 b 21 22 <NA> NA NA
6 <NA> NA NA b 23 24 a 25 26
7 a 27 28 b 29 30 <NA> NA NA
dplyr
之后应该喜欢什么:
x y z x.1 y.1 z.1
1 a 1 2 b 3 4
2 a 5 6 b 7 8
3 a 13 14 b 11 12
4 a 15 16 b 17 18
5 a 19 20 b 21 22
6 a 25 26 b 23 24
7 a 27 28 b 29 30
在这两种情况下,输出数据框都应该有两列,第一列和第二列。
感谢您的帮助!
这将用作静态校正,但根据重复项的数量,您可以将其转换为函数以使其更具动态性。
library(stringr)
# Method One (Works when you have true duplicates from some join methods)
for(i in 1:length(df))
{
Cols = which(colnames(df)==colnames(df)[i])
if(length(Cols)>1){
df[Cols[1]] = NULL
}
}
# Method Two
for(i in 1:length(df))
{
Val = which(strsplit(colnames(df)[i], "")[[1]]==".")
if(length(Val) >= 1 ){
Cols = which(colnames(df)==paste(substr(colnames(df)[i],1,Val-1),".2",sep=''))
df[Cols[1]] = NULL
}
}
这两种情况都是简单的索引问题
拳头案(最简单的)
indx <- is.na(df$x)
df[indx, 1:3] <- df[indx, 7:9]
df[1:6]
# x y z x.1 y.1 z.1
# 1 a 1 2 b 3 4
# 2 a 5 6 b 7 8
# 3 a 13 14 b 11 12
# 4 a 15 16 b 17 18
# 5 a 19 20 b 21 22
# 6 a 25 26 b 23 24
# 7 a 27 28 b 29 30
第二种情况(比较难)
indx <- 1:3
indx2 <- as.logical(rowSums(is.na(df2[indx + 3])))
indx3 <- as.logical(rowSums(is.na(df2[indx])))
df2[indx2, indx + 3] <- df2[indx2, indx]
df2[indx3, indx] <- df2[indx3, indx + 6]
df2[1:6]
# x y z x.1 y.1 z.1
# 1 a 1 2 b 3 4
# 2 b 7 8 b 7 8
# 3 a 13 14 b 11 12
# 4 a 15 16 b 17 18
# 5 a 19 20 b 21 22
# 6 a 25 26 b 23 24
# 7 a 27 28 b 29 30
我知道如何处理只有两个重复块时删除重复列的情况,但在我的真实数据中有 3 个或更多。我试图想出一些玩具示例数据集,其中有一组额外的重复列名,我想将其折叠。 dplyr
和 tidyr
是否有类似的简单方法来解决这些问题?
更简单的情况:
structure(list(x = c("a", "a", NA, "a", "a", NA, "a"), y = c(1,
5, NA, 15, 19, NA, 27), z = c(2, 6, NA, 16, 20, NA, 28), x.1 = c("b",
"b", "b", "b", "b", "b", "b"), y.1 = c(3, 7, 11, 17, 21, 23,
29), z.1 = c(4, 8, 12, 18, 22, 24, 30), x.2 = c(NA, NA, "a",
NA, NA, "a", NA), y.2 = c(NA, NA, 13, NA, NA, 25, NA), z.2 = c(NA,
NA, 14, NA, NA, 26, NA)), .Names = c("x", "y", "z", "x.1", "y.1",
"z.1", "x.2", "y.2", "z.2"), row.names = c(NA, -7L), class = "data.frame")
这在 R 中看起来像:
x y z x.1 y.1 z.1 x.2 y.2 z.2
1 a 1 2 b 3 4 <NA> NA NA
2 a 5 6 b 7 8 <NA> NA NA
3 <NA> NA NA b 11 12 a 13 14
4 a 15 16 b 17 18 <NA> NA NA
5 a 19 20 b 21 22 <NA> NA NA
6 <NA> NA NA b 23 24 a 25 26
7 a 27 28 b 29 30 <NA> NA NA
它应该如何处理 dplyr
:
x y z x.1 y.1 z.1
1 a 1 2 b 3 4
2 a 5 6 b 7 8
3 a 13 14 b 11 12
4 a 15 16 b 17 18
5 a 19 20 b 21 22
6 a 25 26 b 23 24
7 a 27 28 b 29 30
硬壳:
structure(list(x = c("a", "b", NA, "a", "a", NA, "a"), y = c(1,
7, 9, 15, 19, NA, 27), z = c(2, 8, 10, 16, 20, NA, 28), x.1 = c("b",
NA, "b", "b", "b", "b", "b"), y.1 = c(3, NA, 11, 17, 21, 23,
29), z.1 = c(4, NA, 12, 18, 22, 24, 30), x.2 = c(NA, "a", "a",
NA, NA, "a", NA), y.2 = c(NA, 5, 13, NA, NA, 25, NA), z.2 = c(NA,
6, 14, NA, NA, 26, NA)), .Names = c("x", "y", "z", "x.1", "y.1",
"z.1", "x.2", "y.2", "z.2"), row.names = c(NA, -7L), class = "data.frame")
这在 R 中看起来像:
x y z x.1 y.1 z.1 x.2 y.2 z.2
1 a 1 2 b 3 4 <NA> NA NA
2 b 7 8 <NA> NA NA a 5 6
3 <NA> 9 10 b 11 12 a 13 14
4 a 15 16 b 17 18 <NA> NA NA
5 a 19 20 b 21 22 <NA> NA NA
6 <NA> NA NA b 23 24 a 25 26
7 a 27 28 b 29 30 <NA> NA NA
dplyr
之后应该喜欢什么:
x y z x.1 y.1 z.1
1 a 1 2 b 3 4
2 a 5 6 b 7 8
3 a 13 14 b 11 12
4 a 15 16 b 17 18
5 a 19 20 b 21 22
6 a 25 26 b 23 24
7 a 27 28 b 29 30
在这两种情况下,输出数据框都应该有两列,第一列和第二列。
感谢您的帮助!
这将用作静态校正,但根据重复项的数量,您可以将其转换为函数以使其更具动态性。
library(stringr)
# Method One (Works when you have true duplicates from some join methods)
for(i in 1:length(df))
{
Cols = which(colnames(df)==colnames(df)[i])
if(length(Cols)>1){
df[Cols[1]] = NULL
}
}
# Method Two
for(i in 1:length(df))
{
Val = which(strsplit(colnames(df)[i], "")[[1]]==".")
if(length(Val) >= 1 ){
Cols = which(colnames(df)==paste(substr(colnames(df)[i],1,Val-1),".2",sep=''))
df[Cols[1]] = NULL
}
}
这两种情况都是简单的索引问题
拳头案(最简单的)
indx <- is.na(df$x)
df[indx, 1:3] <- df[indx, 7:9]
df[1:6]
# x y z x.1 y.1 z.1
# 1 a 1 2 b 3 4
# 2 a 5 6 b 7 8
# 3 a 13 14 b 11 12
# 4 a 15 16 b 17 18
# 5 a 19 20 b 21 22
# 6 a 25 26 b 23 24
# 7 a 27 28 b 29 30
第二种情况(比较难)
indx <- 1:3
indx2 <- as.logical(rowSums(is.na(df2[indx + 3])))
indx3 <- as.logical(rowSums(is.na(df2[indx])))
df2[indx2, indx + 3] <- df2[indx2, indx]
df2[indx3, indx] <- df2[indx3, indx + 6]
df2[1:6]
# x y z x.1 y.1 z.1
# 1 a 1 2 b 3 4
# 2 b 7 8 b 7 8
# 3 a 13 14 b 11 12
# 4 a 15 16 b 17 18
# 5 a 19 20 b 21 22
# 6 a 25 26 b 23 24
# 7 a 27 28 b 29 30