R：删除已翻转的重复行

Question

第 1 行和第 4 行的信息相同。唯一的区别是它们出现在下面的列已经翻转了。

我已经知道尤马县和夏安县是第 1 行的邻居。我不需要在第 4 行中重申此信息。

           countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
4 Cheyenne County, KS      20023       Yuma County, CO         8125
5 Cheyenne County, KS      20023      Dundy County, NE        31057

我不介意县出现不止一次，我只关心每行的整体信息与前一行不同。我想保留第 1 行并删除第 4 行，以便最终看起来像这样

           countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

如何删除数据集中包含重复信息的行？

Answer 1

我们可以使用 interaction 在找到“较小”（即字母表中第一个）名称以及“较大”名称的名称后生成唯一因子。然后我们可以过滤 data.frame 基于：

CountyList <- read.table(text="countyname fipscounty          neighborname fipsneighbor
1     'Yuma County, CO'       8125   'Cheyenne County, KS'        20023
2     'Yuma County, CO'       8125      'Chase County, NE'        31029
3 'Cheyenne County, KS'      20023 'Kit Carson County, CO'         8063
4 'Cheyenne County, KS'      20023       'Yuma County, CO'         8125
5 'Cheyenne County, KS'      20023      'Dundy County, NE'        31057")


fname <- pmin(CountyList$countyname,CountyList$neighborname) #Get first name
lname <- pmax(CountyList$countyname,CountyList$neighborname) #Get last names

duplicate.key <- as.numeric(interaction(fname,lname)) # Create factors from interaction and convert to numeric

CountyList[match(unique(duplicate.key),duplicate.key),] # Only keep first occurence


           countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

Answer 2

这是一个 tidyverse 方法。

首先 unite 所有列一起进入 new_col （即将所有列粘贴在一起）。然后将 new_col 拆分回它的各个部分和 sort 它们。将其保存到 new_col2。接下来我们只保留 new_col2 的 distinct 行。最后删除新创建的列。

library(tidyverse)

df %>% 
  unite("new_col", everything(), sep = "_", remove = F) %>% 
  rowwise() %>% 
  mutate(new_col2 = paste(sort(str_split(new_col, "_", simplify = T)), collapse = "")) %>% 
  ungroup() %>% 
  distinct(new_col2, .keep_all = T) %>% 
  select(-starts_with("new_col"))

# A tibble: 4 × 4
  countyname          fipscounty neighborname          fipsneighbor
  <chr>                    <int> <chr>                        <int>
1 Yuma County, CO           8125 Cheyenne County, KS          20023
2 Yuma County, CO           8125 Chase County, NE             31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
4 Cheyenne County, KS      20023 Dundy County, NE             31057

数据

df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO", 
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS", 
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO", 
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L, 
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))

Answer 3

您还可以这样做：

idx <- duplicated(t(apply(CountyList[c('fipscounty', 'fipsneighbor')], 1, sort)))
CountyList[!idx, ]

          countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

Answer 4

这是另一个可能的基础 R 选项：

df[!duplicated(t(apply(df, 1, sort))),]

输出

         countyname fipscounty          neighborname fipsneighbor
1     Yuma County, CO       8125   Cheyenne County, KS        20023
2     Yuma County, CO       8125      Chase County, NE        31029
3 Cheyenne County, KS      20023 Kit Carson County, CO         8063
5 Cheyenne County, KS      20023      Dundy County, NE        31057

数据

df <- structure(list(countyname = c("Yuma County, CO", "Yuma County, CO", 
"Cheyenne County, KS", "Cheyenne County, KS", "Cheyenne County, KS"
), fipscounty = c(8125L, 8125L, 20023L, 20023L, 20023L), neighborname = c("Cheyenne County, KS", 
"Chase County, NE", "Kit Carson County, CO", "Yuma County, CO", 
"Dundy County, NE"), fipsneighbor = c(20023L, 31029L, 8063L, 
8125L, 31057L)), class = "data.frame", row.names = c(NA, -5L))

R：删除已翻转的重复行

R: delete duplicate rows that have been flipped

r

data-manipulation

dataframe

数据