R:删除已翻转的重复行
R: delete duplicate rows that have been flipped
第 1 行和第 4 行的信息相同。唯一的区别是它们出现在下面的列已经翻转了。
我已经知道尤马县和夏安县是第 1 行的邻居。我不需要在第 4 行中重申此信息。
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
4 Cheyenne County, KS 20023 Yuma County, CO 8125
5 Cheyenne County, KS 20023 Dundy County, NE 31057
我不介意县出现不止一次,我只关心每行的整体信息与前一行不同。我想保留第 1 行并删除第 4 行,以便最终看起来像这样
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
如何删除数据集中包含重复信息的行?
我们可以使用 interaction
在找到“较小”(即字母表中第一个)名称以及“较大”名称的名称后生成唯一因子。然后我们可以过滤 data.frame
基于:
CountyList <- read.table(text="countyname fipscounty neighborname fipsneighbor
1 'Yuma County, CO' 8125 'Cheyenne County, KS' 20023
2 'Yuma County, CO' 8125 'Chase County, NE' 31029
3 'Cheyenne County, KS' 20023 'Kit Carson County, CO' 8063
4 'Cheyenne County, KS' 20023 'Yuma County, CO' 8125
5 'Cheyenne County, KS' 20023 'Dundy County, NE' 31057")
fname <- pmin(CountyList$countyname,CountyList$neighborname) #Get first name
lname <- pmax(CountyList$countyname,CountyList$neighborname) #Get last names
duplicate.key <- as.numeric(interaction(fname,lname)) # Create factors from interaction and convert to numeric
CountyList[match(unique(duplicate.key),duplicate.key),] # Only keep first occurence
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
这是一个 tidyverse
方法。
首先 unite
所有列一起进入 new_col
(即将所有列粘贴在一起)。然后将 new_col
拆分回它的各个部分和 sort
它们。将其保存到 new_col2
。接下来我们只保留 new_col2
的 distinct
行。最后删除新创建的列。
library(tidyverse)
df %>%
unite("new_col", everything(), sep = "_", remove = F) %>%
rowwise() %>%
mutate(new_col2 = paste(sort(str_split(new_col, "_", simplify = T)), collapse = "")) %>%
ungroup() %>%
distinct(new_col2, .keep_all = T) %>%
select(-starts_with("new_col"))
# A tibble: 4 × 4
countyname fipscounty neighborname fipsneighbor
<chr> <int> <chr> <int>
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
4 Cheyenne County, KS 20023 Dundy County, NE 31057
数据
df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO",
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS",
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO",
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L,
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))
您还可以这样做:
idx <- duplicated(t(apply(CountyList[c('fipscounty', 'fipsneighbor')], 1, sort)))
CountyList[!idx, ]
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
这是另一个可能的基础 R 选项:
df[!duplicated(t(apply(df, 1, sort))),]
输出
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
数据
df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO",
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS",
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO",
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L,
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))
第 1 行和第 4 行的信息相同。唯一的区别是它们出现在下面的列已经翻转了。
我已经知道尤马县和夏安县是第 1 行的邻居。我不需要在第 4 行中重申此信息。
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
4 Cheyenne County, KS 20023 Yuma County, CO 8125
5 Cheyenne County, KS 20023 Dundy County, NE 31057
我不介意县出现不止一次,我只关心每行的整体信息与前一行不同。我想保留第 1 行并删除第 4 行,以便最终看起来像这样
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
如何删除数据集中包含重复信息的行?
我们可以使用 interaction
在找到“较小”(即字母表中第一个)名称以及“较大”名称的名称后生成唯一因子。然后我们可以过滤 data.frame
基于:
CountyList <- read.table(text="countyname fipscounty neighborname fipsneighbor
1 'Yuma County, CO' 8125 'Cheyenne County, KS' 20023
2 'Yuma County, CO' 8125 'Chase County, NE' 31029
3 'Cheyenne County, KS' 20023 'Kit Carson County, CO' 8063
4 'Cheyenne County, KS' 20023 'Yuma County, CO' 8125
5 'Cheyenne County, KS' 20023 'Dundy County, NE' 31057")
fname <- pmin(CountyList$countyname,CountyList$neighborname) #Get first name
lname <- pmax(CountyList$countyname,CountyList$neighborname) #Get last names
duplicate.key <- as.numeric(interaction(fname,lname)) # Create factors from interaction and convert to numeric
CountyList[match(unique(duplicate.key),duplicate.key),] # Only keep first occurence
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
这是一个 tidyverse
方法。
首先 unite
所有列一起进入 new_col
(即将所有列粘贴在一起)。然后将 new_col
拆分回它的各个部分和 sort
它们。将其保存到 new_col2
。接下来我们只保留 new_col2
的 distinct
行。最后删除新创建的列。
library(tidyverse)
df %>%
unite("new_col", everything(), sep = "_", remove = F) %>%
rowwise() %>%
mutate(new_col2 = paste(sort(str_split(new_col, "_", simplify = T)), collapse = "")) %>%
ungroup() %>%
distinct(new_col2, .keep_all = T) %>%
select(-starts_with("new_col"))
# A tibble: 4 × 4
countyname fipscounty neighborname fipsneighbor
<chr> <int> <chr> <int>
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
4 Cheyenne County, KS 20023 Dundy County, NE 31057
数据
df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO",
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS",
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO",
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L,
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))
您还可以这样做:
idx <- duplicated(t(apply(CountyList[c('fipscounty', 'fipsneighbor')], 1, sort)))
CountyList[!idx, ]
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
这是另一个可能的基础 R 选项:
df[!duplicated(t(apply(df, 1, sort))),]
输出
countyname fipscounty neighborname fipsneighbor
1 Yuma County, CO 8125 Cheyenne County, KS 20023
2 Yuma County, CO 8125 Chase County, NE 31029
3 Cheyenne County, KS 20023 Kit Carson County, CO 8063
5 Cheyenne County, KS 20023 Dundy County, NE 31057
数据
df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO",
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS",
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO",
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L,
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))