如何以更好的方式整理多列数据集

Question

我正在努力提高我的 R 脚本技能。

我想整理以下数据框。

discipline applications_total applications_men applications_women awards_total awards_men awards_women
1   Chemical sciences                122               83                 39           32         22           10
2   Physical sciences                174              135                 39           35         26            9
3             Physics                 76               67                  9           20         18            2
4          Humanities                396              230                166           65         33           32
5  Technical sciences                251              189                 62           43         30           13
6   Interdisciplinary                183              105                 78           29         12           17
7 Earth/life sciences                282              156                126           56         38           18
8     Social sciences                834              425                409          112         65           47
9    Medical sciences                505              245                260           75         46           29
  success_rates_total success_rates_men success_rates_women
1                26.2              26.5                25.6
2                20.1              19.3                23.1
3                26.3              26.9                22.2
4                16.4              14.3                19.3
5                17.1              15.9                21.0
6                15.8              11.4                21.8
7                19.9              24.4                14.3
8                13.4              15.3                11.5
9                14.9              18.8                11.2

我想出了以下脚本（经过几个小时在这里研究答案）


library(dplyr)
library(tidyr)
library(stringr)
library(dslabs)

data("research_funding_rates")
research_funding_rates


#using gather

dat <- research_funding_rates %>%
  select(-applications_total,-awards_total, -success_rates_total) %>%
  gather(gender1, rate, starts_with("success_rates_")) %>%
  mutate(gender1 = str_sub(gender1,15)) %>%
  gather(gender2, applications, starts_with("applications_")) %>%
  mutate(gender2 = str_sub(gender2,14)) %>%
  gather(gender, awards, starts_with("awards_")) %>%
  mutate(gender = str_sub(gender,8)) %>%
  filter(gender1 == gender2 & gender1 == gender) %>%
  select(discipline, gender, applications, awards, rate)


#using pivot_longer

research_funding_rates %>%
  select(-applications_total,-awards_total, -success_rates_total) %>%
  pivot_longer(-discipline, names_to = "Name", values_to = "Count") %>%
  mutate(gender = ifelse(row_number() %% 2 == 1, "men","women") ) %>%
  mutate(Name = str_remove_all(Name,c("_men","_women"))) %>%
  group_by(Name) %>%
  spread(Name, Count, convert = TRUE) %>%
  arrange(gender)

我想知道是否有更有效的方法来完成这项任务。谢谢。

Answer 1

我发现使用 gather 的脚本效率更高。

research_funding_rates %>%
  select(-applications_total,-awards_total, -success_rates_total) %>%
  rename( rate_men = success_rates_men, rate_women = success_rates_women) %>%
  gather(key, value, -discipline) %>% #-discipline have to come last
  separate(key, c("type", "gender")) %>% 
  spread(type, value)

separate() “知道”要使用哪个字符串字符。使用 rename 和 separate 应该会简化 pivot_longer 解决方案。

如何以更好的方式整理多列数据集

How to tidy multiple column data sets in a better way

pivot-table

tidy

tibble

gather