如何根据第二个数据集查找多列的平均值？

Question

问题

我需要使用字典数据集来确定我应该计算不同数据集中的哪些列的平均值。

数据

我将使用 iris 数据集（R 中已有的数据集）来说明我的案例。

我有两个数据集：

实际数据 - 如 iris（但是列名称基本上是 a1、a2、a3、a4...）。
前者的字典表示每个 iris 列表示的内容。 dictionary_iris 中的列 feature 是分组变量，它也是新变量名称的一部分（例如，新变量将被称为 Sepal_mean，或 Petal_mean).

鸢尾花数据集

library(dplyr)

iris %>% as_tibble()

  # # A tibble: 150 x 5
  # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  # <dbl>       <dbl>        <dbl>       <dbl> <fct>  
  # 1          5.1         3.5          1.4         0.2 setosa 
  # 2          4.9         3            1.4         0.2 setosa 
  # 3          4.7         3.2          1.3         0.2 setosa 
  # 4          4.6         3.1          1.5         0.2 setosa 
  # 5          5           3.6          1.4         0.2 setosa 
  # 6          5.4         3.9          1.7         0.4 setosa 
  # 7          4.6         3.4          1.4         0.3 setosa 
  # 8          5           3.4          1.5         0.2 setosa 
  # 9          4.4         2.9          1.4         0.2 setosa 
  # 10          4.9         3.1          1.5         0.1 setosa 
  # # ... with 140 more rows

词典数据集

dictionary_iris <- tibble(variables = names(iris)) %>% 
  separate(variables, into = c("feature", "measure"), remove = FALSE)

dictionary_iris

# # A tibble: 5 x 3
# variables    feature measure
# <chr>        <chr>   <chr>  
# 1 Sepal.Length Sepal   Length 
# 2 Sepal.Width  Sepal   Width  
# 3 Petal.Length Petal   Length 
# 4 Petal.Width  Petal   Width  
# 5 Species      Species NA

预期输出

我知道如何手动执行此操作（见下文），但我想自动执行此过程，因为我有一个包含 300 多列的数据框，并且想对这些列采用 23 种不同的方法。

library(dplyr)

iris %>% 
  rowwise() %>% 
  mutate(Sepal_mean = mean(c(Sepal.Length, Sepal.Width), na.rm = TRUE),
         Petal_mean = mean(c(Petal.Length, Petal.Width), na.rm = TRUE))


# # A tibble: 150 x 7
# # Rowwise: 
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_mean Petal_mean
# <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>      <dbl>
# 1          5.1         3.5          1.4         0.2 setosa        4.3        0.8 
# 2          4.9         3            1.4         0.2 setosa        3.95       0.8 
# 3          4.7         3.2          1.3         0.2 setosa        3.95       0.75
# 4          4.6         3.1          1.5         0.2 setosa        3.85       0.85
# 5          5           3.6          1.4         0.2 setosa        4.3        0.8 
# 6          5.4         3.9          1.7         0.4 setosa        4.65       1.05
# 7          4.6         3.4          1.4         0.3 setosa        4          0.85
# 8          5           3.4          1.5         0.2 setosa        4.2        0.85
# 9          4.4         2.9          1.4         0.2 setosa        3.65       0.8 
# 10          4.9         3.1          1.5         0.1 setosa        4          0.8 
# # ... with 140 more rows

我的印象是我可以使用 dplyr::mutate() 和 dplyr::across() 或某些 dplyr::map() 函数来做到这一点。但是我迷路了。

Answer 1

如果打算使用 'feature' 列作为分组，则 split 'dictionary_iris' 由 'feature' 列（删除最后一行（-5) 因为它不是数字列，所以用 imap、transmute 遍历 list 以在 'iris' 中用这些列的 rowMeans 创建列名称，并与原始数据绑定

library(dplyr)
library(purrr)
library(stringr)
out <- imap_dfc(split(dictionary_iris$variables[-5], 
        dictionary_iris$feature[-5]),
   ~ iris %>%
      transmute(!! str_c(.y, "_mean") := 
         rowMeans(across(all_of(.x)), na.rm = TRUE))) %>% 
    bind_cols(iris, .)

-输出

> head(out)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_mean Sepal_mean
1          5.1         3.5          1.4         0.2  setosa       0.80       4.30
2          4.9         3.0          1.4         0.2  setosa       0.80       3.95
3          4.7         3.2          1.3         0.2  setosa       0.75       3.95
4          4.6         3.1          1.5         0.2  setosa       0.85       3.85
5          5.0         3.6          1.4         0.2  setosa       0.80       4.30
6          5.4         3.9          1.7         0.4  setosa       1.05       4.65

Answer 2

iris_means <- iris %>% 
  mutate(id = row_number()) %>% 
  pivot_longer(-c(id, Species)) %>% 
  mutate(name = gsub("\..*", "", name)) %>% 
  group_by(id, name) %>% 
  summarise(val = mean(value)) %>% 
  ungroup %>%
  filter(name %in% !!dictionary_iris$feature) %>% 
  pivot_wider(-name, values_from = "val", names_glue = "{.name}_mean")

iris %>% 
  mutate(id = row_number()) %>% 
  left_join(iris_means) %>% 
  select(-id)


Joining, by = "id"
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal_mean Sepal_mean
1            5.1         3.5          1.4         0.2     setosa       0.80       4.30
2            4.9         3.0          1.4         0.2     setosa       0.80       3.95
3            4.7         3.2          1.3         0.2     setosa       0.75       3.95
4            4.6         3.1          1.5         0.2     setosa       0.85       3.85
5            5.0         3.6          1.4         0.2     setosa       0.80       4.30
6            5.4         3.9          1.7         0.4     setosa       1.05       4.65

如何根据第二个数据集查找多列的平均值？

How to find the mean of multiple columns based on a second dataset?

r

purrr

tidyverse

across

问题

数据

鸢尾花数据集

词典数据集

预期输出