在考虑 NA 的情况下按 2 个条件的平均值汇总

Aggregating by average on 2 conditions while accounting for NA

我有以下table

Data = structure(list(Countries = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("China", "India", "Vietnam"), class = "factor"), Year = c(2019L, 2018L, 2018L, 2018L, 2017L,  2017L, 2019L, 2019L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L,2017L, 2017L, 2019L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L,  2019L, 2019L, 2018L, 2017L, 2017L), Food = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bread","Rice"), class = "factor"), Price = c(2.8, 2.8, 2.7, NA, 2.6, 2.58, 2.53, 2.5, NA, NA, 2.395, 2.9, 2.8, 2.75, 2.66, 2.5, 11.5,11.3, 11.2, 11, NA, 10.7, 10.7, NA, NA, 10.3, 10.1, 10)), class = "data.frame", row.names = c(NA, -28L))

table 显示如下:

Countries Year Food Price
China 2019 Bread 2.8
China 2018 Bread 2.8
China 2018 Bread 2.7
China 2018 Bread NA
China 2017 Bread 2.6
China 2017 Bread 2.58
India 2019 Bread 2.53
India 2019 Bread 2.5
India 2018 Bread NA
India 2018 Bread NA
India 2017 Bread 2.395
Vietnam 2018 Bread 2.9
Vietnam 2018 Bread 2.8
Vietnam 2018 Bread 2.75
Vietnam 2017 Bread 2.66
Vietnam 2017 Bread 2.5
China 2019 Rice 11.5
China 2018 Rice 11.3
China 2018 Rice 11.2
China 2018 Rice 11.0
China 2017 Rice NA
China 2017 Rice 10.7
Vietnam 2019 Rice 10.7
Vietnam 2019 Rice NA
Vietnam 2019 Rice NA
Vietnam 2018 Rice 10.3
Vietnam 2017 Rice 10.1
Vietnam 2017 Rice 10.0

有谁知道如何使用 dplyr and/or 根据国家和年份汇总单个食品的价格(实际数据集有更多国家、年份和食品,但格式相同) tiderverse 同时考虑 NA?即

  1. 如果 2018 年的面包价格为 2.8、2.7 和 NA,则平均值将为 (2.8 + 2.7)/2 而不是 (2.8 + 2.7 + 0)/3
  2. 如果全年的面包价格为 NA,我们可以丢弃它,甚至不必在输出上打印它 table。

输出table

Countries Year Food Price
China 2019 Bread 2.8
China 2018 Bread 2.8
China 2017 Bread 2.6
India 2019 Bread 2.5
India 2017 Bread 2.4
Vietnam 2018 Bread 2.8
Vietnam 2017 Bread 2.6
China 2019 Rice 11.5
China 2018 Rice 11.2
China 2017 Rice 10.7
Vietnam 2019 Rice 10.7
Vietnam 2018 Rice 10.3
Vietnam 2017 Rice 10.1

也是出于真正的好奇,这甚至可以在 base R 中完成吗?

dplyr:

Data %>%
  group_by(Countries, Year, Food) %>%
  summarise(Price = mean(Price, na.rm = TRUE), .groups = 'drop') %>%
  filter(!is.na(Price)) %>%
  arrange(Food, Countries, desc(Year))

#> # A tibble: 13 × 4
#>    Countries  Year Food  Price
#>    <fct>     <int> <fct> <dbl>
#>  1 China      2019 Bread  2.8 
#>  2 China      2018 Bread  2.75
#>  3 China      2017 Bread  2.59
#>  4 India      2019 Bread  2.51
#>  5 India      2017 Bread  2.40
#>  6 Vietnam    2018 Bread  2.82
#>  7 Vietnam    2017 Bread  2.58
#>  8 China      2019 Rice  11.5 
#>  9 China      2018 Rice  11.2 
#> 10 China      2017 Rice  10.7 
#> 11 Vietnam    2019 Rice  10.7 
#> 12 Vietnam    2018 Rice  10.3 
#> 13 Vietnam    2017 Rice  10.0

试试这个:

library(tidyverse)


Data <- structure(list(Countries = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("China", "India", "Vietnam"), class = "factor"), Year = c(2019L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L, 2019L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L, 2019L, 2019L, 2018L, 2017L, 2017L), Food = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bread", "Rice"), class = "factor"), Price = c(2.8, 2.8, 2.7, NA, 2.6, 2.58, 2.53, 2.5, NA, NA, 2.395, 2.9, 2.8, 2.75, 2.66, 2.5, 11.5, 11.3, 11.2, 11, NA, 10.7, 10.7, NA, NA, 10.3, 10.1, 10)), class = "data.frame", row.names = c(NA, -28L))

Data |> 
  group_by(Countries, Year, Food) |> 
  summarise(Price = round(mean(Price, na.rm = TRUE), 1)) |> 
  arrange(Food, Countries, desc(Year)) |> 
  filter(!is.nan(Price))
#> # A tibble: 13 × 4
#> # Groups:   Countries, Year [8]
#>    Countries  Year Food  Price
#>    <fct>     <int> <fct> <dbl>
#>  1 China      2019 Bread   2.8
#>  2 China      2018 Bread   2.8
#>  3 China      2017 Bread   2.6
#>  4 India      2019 Bread   2.5
#>  5 India      2017 Bread   2.4
#>  6 Vietnam    2018 Bread   2.8
#>  7 Vietnam    2017 Bread   2.6
#>  8 China      2019 Rice   11.5
#>  9 China      2018 Rice   11.2
#> 10 China      2017 Rice   10.7
#> 11 Vietnam    2019 Rice   10.7
#> 12 Vietnam    2018 Rice   10.3
#> 13 Vietnam    2017 Rice   10.1

reprex package (v2.0.1)

创建于 2022-05-11

data.table:

library(data.table)
setDT(Data)[!is.na(Price), .(Price=mean(Price,na.rm=T)), by=.(Countries,Year,Food)]

输出:

    Countries Year  Food     Price
 1:     China 2019 Bread  2.800000
 2:     China 2018 Bread  2.750000
 3:     China 2017 Bread  2.590000
 4:     India 2019 Bread  2.515000
 5:     India 2017 Bread  2.395000
 6:   Vietnam 2018 Bread  2.816667
 7:   Vietnam 2017 Bread  2.580000
 8:     China 2019  Rice 11.500000
 9:     China 2018  Rice 11.166667
10:     China 2017  Rice 10.700000
11:   Vietnam 2019  Rice 10.700000
12:   Vietnam 2018  Rice 10.300000
13:   Vietnam 2017  Rice 10.050000

对于基本 R 答案,aggregate 将自动删除缺失值(如果需要,您可以使用 na.action 参数更改此行为)。所以:

aggregate( Price ~ Food + Year + Countries , mean , data=Data)

给你:

    Food Year Countries     Price
1  Bread 2017     China  2.590000
2   Rice 2017     China 10.700000
3  Bread 2018     China  2.750000
4   Rice 2018     China 11.166667
5  Bread 2019     China  2.800000
6   Rice 2019     China 11.500000
7  Bread 2017     India  2.395000
8  Bread 2019     India  2.515000
9  Bread 2017   Vietnam  2.580000
10  Rice 2017   Vietnam 10.050000
11 Bread 2018   Vietnam  2.816667
12  Rice 2018   Vietnam 10.300000
13  Rice 2019   Vietnam 10.700000

如果您希望它们以不同的顺序排列,只需重新排列公式的 RHS。