如何删除具有重复元素的行?
How to remove rows that have repeated elements?
我有一个看起来像这样的数据框(但适用于美国的每个县)
县
状态
neighbor_county
neighbor_state
鲍德温县
铝
克拉克县
不适用
鲍德温县
铝
埃斯坎比亚县
FL
鲍德温县
铝
莫比尔县
不适用
鲍德温县
铝
门罗县
不适用
巴伯县
铝
戴尔县
不适用
巴伯县
铝
亨利县
不适用
我只对县附近的州感兴趣,所以我想删除重复的数据以获得此(第 1 步):
县
状态
neighbor_state
鲍德温县
铝
不适用
鲍德温县
铝
FL
巴伯县
铝
不适用
然后像这样更改数据框的排序(第 2 步):
县
状态
neighbor_state_1
neighbor_state_2
neighbor_state_3
鲍德温县
铝
FL
不适用
不适用
鲍德温县
铝
不适用
不适用
不适用
在第 1 步中,我删除了“neighbor_county”列;但是,我没有设法删除每个不同县的“neighbor_state”列中的重复项。我试过使用 unique 函数,但我似乎无法让它工作,以至于它只能删除每个不同县的重复项。
对于第一步,您可以删除 neighbour_county
列并使用 unique()
:
df$neighbor_county <- NULL
unique(df)
returns
county state neighbor_state
1 Baldwin_County AL NA
2 Baldwin_County AL FL
5 Barbour_County AL NA
使用 dplyr
的替代方法:
df %>%
select(-neighbor_county) %>%
distinct()
对于你的第二步我提个建议:
library(tidyr)
library(dplyr)
df %>%
group_by(county) %>%
select(-neighbor_county) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from=n, names_prefix="neighbor_state_", values_from=neighbor_state) %>%
ungroup()
returns
# A tibble: 2 x 6
county state neighbor_state_1 neighbor_state_2 neighbor_state_3 neighbor_state_4
<chr> <chr> <chr> <chr> <chr> <chr>
1 Baldwin_County AL 'NA' 'FL' 'NA' 'NA'
2 Barbour_County AL 'NA' 'NA' NA NA
但我不确定这是否是您要查找的内容。
要删除双倍的 NA
值,您可以使用
df %>%
group_by(county) %>%
select(-neighbor_county) %>%
distinct() %>%
mutate(n = row_number()) %>%
pivot_wider(names_from=n, names_prefix="neighbor_state_", values_from=neighbor_state) %>%
ungroup()
数据
structure(list(county = c("Baldwin_County", "Baldwin_County",
"Baldwin_County", "Baldwin_County", "Barbour_County", "Barbour_County"
), state = c("AL", "AL", "AL", "AL", "AL", "AL"), neighbor_county = c("Clarke_County",
"Escambia_County", "Mobile_County", "Monroe_County", "Dale_County",
"Henry_County"), neighbor_state = c("'NA'", "'FL'", "'NA'", "'NA'",
"'NA'", "'NA'")), problems = structure(list(row = 6L, col = "neighbor_state",
expected = "", actual = "embedded null", file = "literal data"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame")), class = "data.frame", row.names = c(NA,
-6L), spec = structure(list(cols = list(county = structure(list(), class = c("collector_character",
"collector")), state = structure(list(), class = c("collector_character",
"collector")), neighbor_county = structure(list(), class = c("collector_character",
"collector")), neighbor_state = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
我有一个看起来像这样的数据框(但适用于美国的每个县)
县 | 状态 | neighbor_county | neighbor_state |
---|---|---|---|
鲍德温县 | 铝 | 克拉克县 | 不适用 |
鲍德温县 | 铝 | 埃斯坎比亚县 | FL |
鲍德温县 | 铝 | 莫比尔县 | 不适用 |
鲍德温县 | 铝 | 门罗县 | 不适用 |
巴伯县 | 铝 | 戴尔县 | 不适用 |
巴伯县 | 铝 | 亨利县 | 不适用 |
我只对县附近的州感兴趣,所以我想删除重复的数据以获得此(第 1 步):
县 | 状态 | neighbor_state |
---|---|---|
鲍德温县 | 铝 | 不适用 |
鲍德温县 | 铝 | FL |
巴伯县 | 铝 | 不适用 |
然后像这样更改数据框的排序(第 2 步):
县 | 状态 | neighbor_state_1 | neighbor_state_2 | neighbor_state_3 |
---|---|---|---|---|
鲍德温县 | 铝 | FL | 不适用 | 不适用 |
鲍德温县 | 铝 | 不适用 | 不适用 | 不适用 |
在第 1 步中,我删除了“neighbor_county”列;但是,我没有设法删除每个不同县的“neighbor_state”列中的重复项。我试过使用 unique 函数,但我似乎无法让它工作,以至于它只能删除每个不同县的重复项。
对于第一步,您可以删除 neighbour_county
列并使用 unique()
:
df$neighbor_county <- NULL
unique(df)
returns
county state neighbor_state
1 Baldwin_County AL NA
2 Baldwin_County AL FL
5 Barbour_County AL NA
使用 dplyr
的替代方法:
df %>%
select(-neighbor_county) %>%
distinct()
对于你的第二步我提个建议:
library(tidyr)
library(dplyr)
df %>%
group_by(county) %>%
select(-neighbor_county) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from=n, names_prefix="neighbor_state_", values_from=neighbor_state) %>%
ungroup()
returns
# A tibble: 2 x 6
county state neighbor_state_1 neighbor_state_2 neighbor_state_3 neighbor_state_4
<chr> <chr> <chr> <chr> <chr> <chr>
1 Baldwin_County AL 'NA' 'FL' 'NA' 'NA'
2 Barbour_County AL 'NA' 'NA' NA NA
但我不确定这是否是您要查找的内容。
要删除双倍的 NA
值,您可以使用
df %>%
group_by(county) %>%
select(-neighbor_county) %>%
distinct() %>%
mutate(n = row_number()) %>%
pivot_wider(names_from=n, names_prefix="neighbor_state_", values_from=neighbor_state) %>%
ungroup()
数据
structure(list(county = c("Baldwin_County", "Baldwin_County",
"Baldwin_County", "Baldwin_County", "Barbour_County", "Barbour_County"
), state = c("AL", "AL", "AL", "AL", "AL", "AL"), neighbor_county = c("Clarke_County",
"Escambia_County", "Mobile_County", "Monroe_County", "Dale_County",
"Henry_County"), neighbor_state = c("'NA'", "'FL'", "'NA'", "'NA'",
"'NA'", "'NA'")), problems = structure(list(row = 6L, col = "neighbor_state",
expected = "", actual = "embedded null", file = "literal data"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame")), class = "data.frame", row.names = c(NA,
-6L), spec = structure(list(cols = list(county = structure(list(), class = c("collector_character",
"collector")), state = structure(list(), class = c("collector_character",
"collector")), neighbor_county = structure(list(), class = c("collector_character",
"collector")), neighbor_state = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))