通过变量 id 列对齐不同的观察值以消除 NA 值
Align different observations by variable id columns to eliminate NA values
我有一个类似于此格式的 R 数据框:
df<-data.frame(
Fruit=c("Apples","Apples","Apples","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Cherries","Cherries","Cherries"),
Order=c("A","A","A","B","B","B","B","B","B","B","B","B","B","B","B"),
Producer_ID=c("V","V","V","W","W","W","X","X","X","Y","Y","Y","Z","Z","Z"),
x1=c( 0.7975,NA,NA,0.570975,NA,NA,0.43374375,NA,NA,0.610624,NA,NA,0.6975,NA,NA),
y1=c( 200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA),
x2=c( NA,NA,0.82,NA,NA,0.51,NA,NA,0.415,NA,NA,0.542,NA,NA,0.62),
y2=c( NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200),
x3=c( NA,NA,0.82,NA,NA,0.505,NA,NA,0.4125,NA,NA,0.538,NA,NA,0.62),
y3=c( NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200),
x4=c( NA,0.82,NA,NA,NA,0.5,NA,NA,0.41,NA,NA,0.534,NA,NA,0.62),
y4=c( NA,300,NA,NA,NA,200,NA,NA,200,NA,NA,200,NA, NA,200),
x5=c( NA,0.82,NA,NA,NA,0.495,NA,NA,0.4075,NA,NA,0.53,NA,0.62,NA),
y5=c( NA,300,NA,NA,NA,200,NA,NA,200,NA,NA,200,NA,300,NA),
x6=c( NA,0.82,NA,NA,0.49,NA,NA,0.405,NA,NA,0.52,NA,NA,0.62,NA),
y6=c( NA, 300,NA,NA,300,NA,NA,300,NA,NA,300,NA,NA,300,NA)
)
输出:
> df
Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6
1 Apples A V 0.7975000 200 NA NA NA NA NA NA NA NA NA NA
2 Apples A V NA NA NA NA NA NA 0.820 300 0.8200 300 0.820 300
3 Apples A V NA NA 0.820 200 0.8200 200 NA NA NA NA NA NA
4 Bananas B W 0.5709750 200 NA NA NA NA NA NA NA NA NA NA
5 Bananas B W NA NA NA NA NA NA NA NA NA NA 0.490 300
6 Bananas B W NA NA 0.510 200 0.5050 200 0.500 200 0.4950 200 NA NA
7 Bananas B X 0.4337437 200 NA NA NA NA NA NA NA NA NA NA
8 Bananas B X NA NA NA NA NA NA NA NA NA NA 0.405 300
9 Bananas B X NA NA 0.415 200 0.4125 200 0.410 200 0.4075 200 NA NA
10 Bananas B Y 0.6106240 200 NA NA NA NA NA NA NA NA NA NA
11 Bananas B Y NA NA NA NA NA NA NA NA NA NA 0.520 300
12 Bananas B Y NA NA 0.542 200 0.5380 200 0.534 200 0.5300 200 NA NA
13 Cherries B Z 0.6975000 200 NA NA NA NA NA NA NA NA NA NA
14 Cherries B Z NA NA NA NA NA NA NA NA 0.6200 300 0.620 300
15 Cherries B Z NA NA 0.620 200 0.6200 200 0.620 200 NA NA NA NA
对于标识符列“水果”、“订单”、“生产者”的每种组合,在剩余的每一列中恰好有一个非 NA 值。我想做的是整理我的数据框以消除 NA 值,将非 NA 值与其标识符对齐,如下所示:
df_final<-data.frame(
Fruit=c("Apples","Bananas","Bananas","Bananas","Cherries"),
Order=c("A","B","B","B","B"),
Producer_ID=c("V","W","X","Y","Z"),
x1=c(0.7975,0.570975,0.43374375,0.610624,0.6975),
y1=c(200,200,200,200,200),
x2=c(0.82,0.51,0.415,0.542,0.62),
y2=c(200,200,200,200,200),
x3=c(0.82,0.505,0.4125,0.538,0.62),
y3=c(200,200,200,200,200),
x4=c(0.82,0.5,0.41,0.534,0.62),
y4=c(300,200,200,200,200),
x5=c(0.82,0.495,0.4075,0.53,0.62),
y5=c(300,200,200,200,300),
x6=c(0.82,0.49,0.405,0.526,0.62),
y6=c(300,300,300,300,300)
)
最终输出:
> df_final
Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6
1 Apples A V 0.7975000 200 0.820 200 0.8200 200 0.820 300 0.8200 300 0.820 300
2 Bananas B W 0.5709750 200 0.510 200 0.5050 200 0.500 200 0.4950 200 0.490 300
3 Bananas B X 0.4337437 200 0.415 200 0.4125 200 0.410 200 0.4075 200 0.405 300
4 Bananas B Y 0.6106240 200 0.542 200 0.5380 200 0.534 200 0.5300 200 0.526 300
5 Cherries B Z 0.6975000 200 0.620 200 0.6200 200 0.620 200 0.6200 300 0.620 300
由于一组(Fruit
、Order
和 Producer_ID
)的每一列中只有一个值,因此您可以取 sum
(或 max
) 每列忽略 NA
值。
在基础 R 中 aggregate
-
aggregate(.~Fruit+Order+Producer_ID, df, sum, na.rm = TRUE, na.action = NULL)
# Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6
#1 Apples A V 0.7975000 200 0.820 200 0.8200 200 0.820 300 0.8200 300 0.820 300
#2 Bananas B W 0.5709750 200 0.510 200 0.5050 200 0.500 200 0.4950 200 0.490 300
#3 Bananas B X 0.4337437 200 0.415 200 0.4125 200 0.410 200 0.4075 200 0.405 300
#4 Bananas B Y 0.6106240 200 0.542 200 0.5380 200 0.534 200 0.5300 200 0.520 300
#5 Cherries B Z 0.6975000 200 0.620 200 0.6200 200 0.620 200 0.6200 300 0.620 300
或使用dplyr
-
library(dplyr)
df %>%
group_by(Fruit, Order ,Producer_ID) %>%
summarise(across(x1:y6, sum, na.rm = TRUE))
与@Ronak Shah 的 sum
类似,这里有一个带有 sort
的可选方法,用于聚合同一行中的所有非 NA 值(灵感来自
@r2evans )
df |>
mutate(across(everything(), ~ sort(., na.last = TRUE))) |>
na.omit()
另一种选择是使用 tidyr
重塑为长格式,摆脱 NA,最后重塑回宽格式。
注意:作为中间步骤,我使用 arrange(gsub("^.*?(\d)$", "\1", name))
重新排列数据,以便 wide df 中的列按照您想要的顺序排列。
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-c(Fruit, Order, Producer_ID)) %>%
drop_na() %>%
arrange(gsub("^.*?(\d)$", "\1", name)) %>%
pivot_wider(names_from = "name", values_from = "value")
#> # A tibble: 5 × 15
#> Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Apples A V 0.798 200 0.82 200 0.82 200 0.82 300 0.82
#> 2 Banan… B W 0.571 200 0.51 200 0.505 200 0.5 200 0.495
#> 3 Banan… B X 0.434 200 0.415 200 0.412 200 0.41 200 0.408
#> 4 Banan… B Y 0.611 200 0.542 200 0.538 200 0.534 200 0.53
#> 5 Cherr… B Z 0.698 200 0.62 200 0.62 200 0.62 200 0.62
#> # … with 3 more variables: y5 <dbl>, x6 <dbl>, y6 <dbl>
我有一个类似于此格式的 R 数据框:
df<-data.frame(
Fruit=c("Apples","Apples","Apples","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Bananas","Cherries","Cherries","Cherries"),
Order=c("A","A","A","B","B","B","B","B","B","B","B","B","B","B","B"),
Producer_ID=c("V","V","V","W","W","W","X","X","X","Y","Y","Y","Z","Z","Z"),
x1=c( 0.7975,NA,NA,0.570975,NA,NA,0.43374375,NA,NA,0.610624,NA,NA,0.6975,NA,NA),
y1=c( 200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA),
x2=c( NA,NA,0.82,NA,NA,0.51,NA,NA,0.415,NA,NA,0.542,NA,NA,0.62),
y2=c( NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200),
x3=c( NA,NA,0.82,NA,NA,0.505,NA,NA,0.4125,NA,NA,0.538,NA,NA,0.62),
y3=c( NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200,NA,NA,200),
x4=c( NA,0.82,NA,NA,NA,0.5,NA,NA,0.41,NA,NA,0.534,NA,NA,0.62),
y4=c( NA,300,NA,NA,NA,200,NA,NA,200,NA,NA,200,NA, NA,200),
x5=c( NA,0.82,NA,NA,NA,0.495,NA,NA,0.4075,NA,NA,0.53,NA,0.62,NA),
y5=c( NA,300,NA,NA,NA,200,NA,NA,200,NA,NA,200,NA,300,NA),
x6=c( NA,0.82,NA,NA,0.49,NA,NA,0.405,NA,NA,0.52,NA,NA,0.62,NA),
y6=c( NA, 300,NA,NA,300,NA,NA,300,NA,NA,300,NA,NA,300,NA)
)
输出:
> df
Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6
1 Apples A V 0.7975000 200 NA NA NA NA NA NA NA NA NA NA
2 Apples A V NA NA NA NA NA NA 0.820 300 0.8200 300 0.820 300
3 Apples A V NA NA 0.820 200 0.8200 200 NA NA NA NA NA NA
4 Bananas B W 0.5709750 200 NA NA NA NA NA NA NA NA NA NA
5 Bananas B W NA NA NA NA NA NA NA NA NA NA 0.490 300
6 Bananas B W NA NA 0.510 200 0.5050 200 0.500 200 0.4950 200 NA NA
7 Bananas B X 0.4337437 200 NA NA NA NA NA NA NA NA NA NA
8 Bananas B X NA NA NA NA NA NA NA NA NA NA 0.405 300
9 Bananas B X NA NA 0.415 200 0.4125 200 0.410 200 0.4075 200 NA NA
10 Bananas B Y 0.6106240 200 NA NA NA NA NA NA NA NA NA NA
11 Bananas B Y NA NA NA NA NA NA NA NA NA NA 0.520 300
12 Bananas B Y NA NA 0.542 200 0.5380 200 0.534 200 0.5300 200 NA NA
13 Cherries B Z 0.6975000 200 NA NA NA NA NA NA NA NA NA NA
14 Cherries B Z NA NA NA NA NA NA NA NA 0.6200 300 0.620 300
15 Cherries B Z NA NA 0.620 200 0.6200 200 0.620 200 NA NA NA NA
对于标识符列“水果”、“订单”、“生产者”的每种组合,在剩余的每一列中恰好有一个非 NA 值。我想做的是整理我的数据框以消除 NA 值,将非 NA 值与其标识符对齐,如下所示:
df_final<-data.frame(
Fruit=c("Apples","Bananas","Bananas","Bananas","Cherries"),
Order=c("A","B","B","B","B"),
Producer_ID=c("V","W","X","Y","Z"),
x1=c(0.7975,0.570975,0.43374375,0.610624,0.6975),
y1=c(200,200,200,200,200),
x2=c(0.82,0.51,0.415,0.542,0.62),
y2=c(200,200,200,200,200),
x3=c(0.82,0.505,0.4125,0.538,0.62),
y3=c(200,200,200,200,200),
x4=c(0.82,0.5,0.41,0.534,0.62),
y4=c(300,200,200,200,200),
x5=c(0.82,0.495,0.4075,0.53,0.62),
y5=c(300,200,200,200,300),
x6=c(0.82,0.49,0.405,0.526,0.62),
y6=c(300,300,300,300,300)
)
最终输出:
> df_final
Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6
1 Apples A V 0.7975000 200 0.820 200 0.8200 200 0.820 300 0.8200 300 0.820 300
2 Bananas B W 0.5709750 200 0.510 200 0.5050 200 0.500 200 0.4950 200 0.490 300
3 Bananas B X 0.4337437 200 0.415 200 0.4125 200 0.410 200 0.4075 200 0.405 300
4 Bananas B Y 0.6106240 200 0.542 200 0.5380 200 0.534 200 0.5300 200 0.526 300
5 Cherries B Z 0.6975000 200 0.620 200 0.6200 200 0.620 200 0.6200 300 0.620 300
由于一组(Fruit
、Order
和 Producer_ID
)的每一列中只有一个值,因此您可以取 sum
(或 max
) 每列忽略 NA
值。
在基础 R 中 aggregate
-
aggregate(.~Fruit+Order+Producer_ID, df, sum, na.rm = TRUE, na.action = NULL)
# Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6
#1 Apples A V 0.7975000 200 0.820 200 0.8200 200 0.820 300 0.8200 300 0.820 300
#2 Bananas B W 0.5709750 200 0.510 200 0.5050 200 0.500 200 0.4950 200 0.490 300
#3 Bananas B X 0.4337437 200 0.415 200 0.4125 200 0.410 200 0.4075 200 0.405 300
#4 Bananas B Y 0.6106240 200 0.542 200 0.5380 200 0.534 200 0.5300 200 0.520 300
#5 Cherries B Z 0.6975000 200 0.620 200 0.6200 200 0.620 200 0.6200 300 0.620 300
或使用dplyr
-
library(dplyr)
df %>%
group_by(Fruit, Order ,Producer_ID) %>%
summarise(across(x1:y6, sum, na.rm = TRUE))
与@Ronak Shah 的 sum
类似,这里有一个带有 sort
的可选方法,用于聚合同一行中的所有非 NA 值(灵感来自
@r2evans )
df |>
mutate(across(everything(), ~ sort(., na.last = TRUE))) |>
na.omit()
另一种选择是使用 tidyr
重塑为长格式,摆脱 NA,最后重塑回宽格式。
注意:作为中间步骤,我使用 arrange(gsub("^.*?(\d)$", "\1", name))
重新排列数据,以便 wide df 中的列按照您想要的顺序排列。
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-c(Fruit, Order, Producer_ID)) %>%
drop_na() %>%
arrange(gsub("^.*?(\d)$", "\1", name)) %>%
pivot_wider(names_from = "name", values_from = "value")
#> # A tibble: 5 × 15
#> Fruit Order Producer_ID x1 y1 x2 y2 x3 y3 x4 y4 x5
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Apples A V 0.798 200 0.82 200 0.82 200 0.82 300 0.82
#> 2 Banan… B W 0.571 200 0.51 200 0.505 200 0.5 200 0.495
#> 3 Banan… B X 0.434 200 0.415 200 0.412 200 0.41 200 0.408
#> 4 Banan… B Y 0.611 200 0.542 200 0.538 200 0.534 200 0.53
#> 5 Cherr… B Z 0.698 200 0.62 200 0.62 200 0.62 200 0.62
#> # … with 3 more variables: y5 <dbl>, x6 <dbl>, y6 <dbl>