Coalescing/merging 行但保留 "dominant" 值
Coalescing/merging rows but retaining "dominant" values
我的问题看似微不足道,但显然我没有想出合适的搜索词。
我的数据是这样的:
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("A","B","A","B","C"),
V2 = c("C","B",NA,"B","A"),
V3 = c("A","B","C","B",NA))
我想按 ID 合并或合并行,并且每个 ID 仅保留一行,每列中具有“最高”值。在我的示例中,我希望 C 优先于 B 而不是 A。
经过所需的操作后,我的数据将如下所示:
| ID | V1 | V2 | V3 |
| -- | -- | -- | -- |
| 1 | B | C | B |
| 2 | A | NA | C |
| 3 | C | B | B |
如有任何提示,我们将不胜感激! Dplyr 是首选,但不是必需的。谢谢!
编辑:解决方案(谢谢!)都利用了字母在 R 中是“有序”的这一事实。
让我们以这个示例数据为例:
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("yes","no","yes","no","unsure"),
V2 = c("unsure","no",NA,"no","yes"),
V3 = c("yes","no","unsure","no",NA))
期望的结果是“是”优先于“否”而不是“不确定”。
编辑:添加了更简单的 dplyr-only
library(dplyr)
data %>%
group_by(ID) %>%
summarize(across(V1:V3, max))
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A NA C
3 3 C B NA
如果您需要有序的因子,这里有一种方法,我们指定顺序,将其应用于 V1:V3 中的数据,然后像以前一样继续。
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("yes","no","yes","no","unsure"),
V2 = c("unsure","no",NA,"no","yes"),
V3 = c("yes","no","unsure","no",NA))
var_order <- c("yes", "no", "unsure")
# Note addition of `ordered = TRUE` to make the `min` work
data %>%
mutate(across(V1:V3, ~factor(.x, levels = var_order, ordered = TRUE))) %>%
group_by(ID) %>%
summarize(across(V1:V3, ~min(., na.rm = TRUE)))
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <ord> <ord> <ord>
1 1 yes no yes
2 2 yes NA unsure
3 3 no yes no
使用 tidyr 重塑的早期解决方案。这在没有设置 ordered = TRUE
标志的情况下有效,但对于较大的数据集来说效率低下。
library(dplyr); library(tidyr)
data %>%
mutate(across(V1:V3, ~factor(.x, levels = var_order))) %>%
pivot_longer(-ID) %>%
group_by(ID, name) %>%
slice_min(value) %>%
ungroup() %>%
pivot_wider(names_from = name)
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <fct> <fct> <fct>
1 1 yes no yes
2 2 yes NA unsure
3 3 no yes no
既然我们可以从字母表中得到一个字母的最大值,我们可以使用:
library(tidyverse)
data %>%
group_by(ID) %>%
summarize(across(everything(), ~ max(., na.rm = TRUE)))
给出:
# A tibble: 3 x 4
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A <NA> C
3 3 C B B
这是 base
中的解决方案:
aggregate(data[,-1], by = list(ID=data$ID), FUN = max, na.rm = T)
# ID V1 V2 V3
# 1 1 B C B
# 2 2 A <NA> C
# 3 3 C B B
我的问题看似微不足道,但显然我没有想出合适的搜索词。
我的数据是这样的:
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("A","B","A","B","C"),
V2 = c("C","B",NA,"B","A"),
V3 = c("A","B","C","B",NA))
我想按 ID 合并或合并行,并且每个 ID 仅保留一行,每列中具有“最高”值。在我的示例中,我希望 C 优先于 B 而不是 A。
经过所需的操作后,我的数据将如下所示:
| ID | V1 | V2 | V3 |
| -- | -- | -- | -- |
| 1 | B | C | B |
| 2 | A | NA | C |
| 3 | C | B | B |
如有任何提示,我们将不胜感激! Dplyr 是首选,但不是必需的。谢谢!
编辑:解决方案(谢谢!)都利用了字母在 R 中是“有序”的这一事实。
让我们以这个示例数据为例:
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("yes","no","yes","no","unsure"),
V2 = c("unsure","no",NA,"no","yes"),
V3 = c("yes","no","unsure","no",NA))
期望的结果是“是”优先于“否”而不是“不确定”。
编辑:添加了更简单的 dplyr-only
library(dplyr)
data %>%
group_by(ID) %>%
summarize(across(V1:V3, max))
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A NA C
3 3 C B NA
如果您需要有序的因子,这里有一种方法,我们指定顺序,将其应用于 V1:V3 中的数据,然后像以前一样继续。
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("yes","no","yes","no","unsure"),
V2 = c("unsure","no",NA,"no","yes"),
V3 = c("yes","no","unsure","no",NA))
var_order <- c("yes", "no", "unsure")
# Note addition of `ordered = TRUE` to make the `min` work
data %>%
mutate(across(V1:V3, ~factor(.x, levels = var_order, ordered = TRUE))) %>%
group_by(ID) %>%
summarize(across(V1:V3, ~min(., na.rm = TRUE)))
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <ord> <ord> <ord>
1 1 yes no yes
2 2 yes NA unsure
3 3 no yes no
使用 tidyr 重塑的早期解决方案。这在没有设置 ordered = TRUE
标志的情况下有效,但对于较大的数据集来说效率低下。
library(dplyr); library(tidyr)
data %>%
mutate(across(V1:V3, ~factor(.x, levels = var_order))) %>%
pivot_longer(-ID) %>%
group_by(ID, name) %>%
slice_min(value) %>%
ungroup() %>%
pivot_wider(names_from = name)
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <fct> <fct> <fct>
1 1 yes no yes
2 2 yes NA unsure
3 3 no yes no
既然我们可以从字母表中得到一个字母的最大值,我们可以使用:
library(tidyverse)
data %>%
group_by(ID) %>%
summarize(across(everything(), ~ max(., na.rm = TRUE)))
给出:
# A tibble: 3 x 4
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A <NA> C
3 3 C B B
这是 base
中的解决方案:
aggregate(data[,-1], by = list(ID=data$ID), FUN = max, na.rm = T)
# ID V1 V2 V3
# 1 1 B C B
# 2 2 A <NA> C
# 3 3 C B B