在 R 中使用多列按键和年份合并行
Merge rows by key and year with multiple columns in R
我有以下数据集:
df1 <- data.frame(
"key" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3),
"year" = c(2002, 2002, 2004, 2004, 2002, 2002, 2004, 2004, 2004, 2004),
"Var1" = c(10, NA, 5, 5, 4, NA, NA, 3, 2, 2),
"Var2" = c(1, 1, 3, 3, 2, NA, 3, NA, 1, NA),
"Var3" = c(NA, 2, NA, NA, 5, 5, 3, NA, 2, NA),
"Var4" = c(NA, 4, 5, 5, 6, NA, 4, NA, NA, NA))
我现在想按键和年份合并重复的行以获得如下所示的数据集:
df2 <- data.frame(
"key" = c(1, 1, 2, 2, 3),
"year" = c(2002, 2004, 2002, 2004, 2004),
"Var1" = c(10, 5, 4, 3, 2),
"Var2" = c(1, 3, 2, 3, 1),
"Var3" = c(2, NA, 5, 3, 2),
"Var4" = c(4, 5, 6, 4, NA))
问题是我有超过 30 列和成百上千行。因此,这个解决方案似乎有点不方便:Merge rows within a dataframe by a key。
如果有任何帮助,我将不胜感激!
您可以group_by(key, year)
并获取每列的最大值,不包括 NA 和只有 NA 的组:
library(dplyr)
df1 %>%
group_by(key, year) %>%
summarise(across(everything(), ~ ifelse(all(is.na(.x)), NA, max(.x, na.rm = T))))
## A tibble: 5 x 6
## Groups: key [3]
# key year Var1 Var2 Var3 Var4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2002 10 1 2 4
#2 1 2004 5 3 NA 5
#3 2 2002 4 2 5 6
#4 2 2004 3 3 3 4
#5 3 2004 2 1 2 NA
您可以使用 fill()
按每个组填充缺失值,并使用 distinct()
查找唯一行。
library(tidyverse)
df1 %>%
group_by(key, year) %>%
fill(Var1:Var4, .direction = "downup") %>%
distinct() %>%
ungroup()
# A tibble: 5 × 6
key year Var1 Var2 Var3 Var4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2002 10 1 2 4
2 1 2004 5 3 NA 5
3 2 2002 4 2 5 6
4 2 2004 3 3 3 4
5 3 2004 2 1 2 NA
我有以下数据集:
df1 <- data.frame(
"key" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3),
"year" = c(2002, 2002, 2004, 2004, 2002, 2002, 2004, 2004, 2004, 2004),
"Var1" = c(10, NA, 5, 5, 4, NA, NA, 3, 2, 2),
"Var2" = c(1, 1, 3, 3, 2, NA, 3, NA, 1, NA),
"Var3" = c(NA, 2, NA, NA, 5, 5, 3, NA, 2, NA),
"Var4" = c(NA, 4, 5, 5, 6, NA, 4, NA, NA, NA))
我现在想按键和年份合并重复的行以获得如下所示的数据集:
df2 <- data.frame(
"key" = c(1, 1, 2, 2, 3),
"year" = c(2002, 2004, 2002, 2004, 2004),
"Var1" = c(10, 5, 4, 3, 2),
"Var2" = c(1, 3, 2, 3, 1),
"Var3" = c(2, NA, 5, 3, 2),
"Var4" = c(4, 5, 6, 4, NA))
问题是我有超过 30 列和成百上千行。因此,这个解决方案似乎有点不方便:Merge rows within a dataframe by a key。 如果有任何帮助,我将不胜感激!
您可以group_by(key, year)
并获取每列的最大值,不包括 NA 和只有 NA 的组:
library(dplyr)
df1 %>%
group_by(key, year) %>%
summarise(across(everything(), ~ ifelse(all(is.na(.x)), NA, max(.x, na.rm = T))))
## A tibble: 5 x 6
## Groups: key [3]
# key year Var1 Var2 Var3 Var4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2002 10 1 2 4
#2 1 2004 5 3 NA 5
#3 2 2002 4 2 5 6
#4 2 2004 3 3 3 4
#5 3 2004 2 1 2 NA
您可以使用 fill()
按每个组填充缺失值,并使用 distinct()
查找唯一行。
library(tidyverse)
df1 %>%
group_by(key, year) %>%
fill(Var1:Var4, .direction = "downup") %>%
distinct() %>%
ungroup()
# A tibble: 5 × 6
key year Var1 Var2 Var3 Var4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2002 10 1 2 4
2 1 2004 5 3 NA 5
3 2 2002 4 2 5 6
4 2 2004 3 3 3 4
5 3 2004 2 1 2 NA