在长数据集中添加两个分类变量的行?

Adding rows of two categorical variables in a long dataset?

我有一个长格式矩阵(面板数据),其中包含几个字符串变量、一个分类变量和一个具有数值的变量。

该数据包含特定年份各国多个工业部门的产出信息。我的想法是在同一国家/地区的同一年份添加其中两个行业,并更改新创建行业的名称。

例如,假设我有以下矩阵:

set.seed(10)

matrix <- cbind.data.frame(country = rep(c("aaa" , "bbb") , each = 6) , industry = rep(c("toys" , "paper") ,  each = 3 , times = 2) , 
                year = rep(c(2000:2002) , times = 4) , production = sample(0:100 , 12) )

给出:

      country industry year   production
 [1,] "aaa"   "toys"   "2000" "8"       
 [2,] "aaa"   "toys"   "2001" "73"      
 [3,] "aaa"   "toys"   "2002" "75"      
 [4,] "aaa"   "paper"  "2000" "54"      
 [5,] "aaa"   "paper"  "2001" "71"      
 [6,] "aaa"   "paper"  "2002" "53"      
 [7,] "bbb"   "toys"   "2000" "38"      
 [8,] "bbb"   "toys"   "2001" "82"      
 [9,] "bbb"   "toys"   "2002" "87"      
[10,] "bbb"   "paper"  "2000" "14"      
[11,] "bbb"   "paper"  "2001" "91"      
[12,] "bbb"   "paper"  "2002" "41"  

我想将每年和每个国家/地区的“玩具”生产与“造纸”生产相加,并将新行业称为“玩具和造纸”:

  year country       variable value
1 2000     aaa toys_and_paper    62
2 2000     bbb toys_and_paper    52
3 2001     aaa toys_and_paper   144
4 2001     bbb toys_and_paper   173
5 2002     aaa toys_and_paper   128
6 2002     bbb toys_and_paper   128

我知道如何使用 reshape2 和 tidyverse 来做到这一点:

library(reshape2)
library(tidyverse)

test <- dcast(matrix , year + country ~ industry)

test <- test %>%
  mutate(toys_and_paper = paper + toys) %>%
  select(year , country , toys_and_paper)

test <- melt(test , id.vars = c("year" , "country"))

有更直接的方法吗?

使用 aggregate

的基础 R 选项
aggregate(
    production ~ .,
    transform(
        matrix,
        industry = ave(industry,
            country,
            year,
            FUN = function(v) paste0(v, collapse = "_and_")
        )
    ), sum
)

给予

  country       industry year production
1     aaa toys_and_paper 2000         62
2     bbb toys_and_paper 2000         52
3     aaa toys_and_paper 2001        144
4     bbb toys_and_paper 2001        173
5     aaa toys_and_paper 2002        128
6     bbb toys_and_paper 2002        128

我认为原来的问题有一个令人困惑的例子,因为在实际数据集中可能有更多的行业。

这里有一个包含两个以上行业的玩具数据集

set.seed(10)

matrix <- cbind.data.frame(
   country = rep(c("aaa", "bbb"), each = 9),
   industry = rep(c("toys", "paper", "other"), each = 3, times = 2),
   year = rep(c(2000:2002), times = 6),
   production = sample(0:100, 18)
)

以及dplyr问题的解决方案

matrix %>% 
   dplyr::mutate(
      industry = dplyr::if_else(
         industry %in% c("toys", "paper"), "toys_and_paper", industry
      )
   ) %>% 
   dplyr::group_by(
      year,
      country,
      industry
   ) %>% 
   dplyr::summarise(
      production = sum(production),
      .groups = "drop"
   )

如果在实际数据集中 industry 列是一个因素(应该如此),那么您可以将 if_else() 语句替换为 forcats::fct_unify()

你也可以使用这个:

library(dplyr)
library(purrr)

matrix %>%
  group_split(country, year) %>%
  map_dfr(~ .x %>%
            add_row(country = .x$country[1], 
                    industry = paste(.x$industry[1], "and", .x$industry[2], sep = "_"),
                    year = .x$year[1], 
                    production = sum(.x$production, na.rm = TRUE))) %>%
  filter(industry == "toys_and_paper") %>%
  arrange(year)

# A tibble: 6 x 4
  country industry        year production
  <chr>   <chr>          <int>      <int>
1 aaa     toys_and_paper  2000         62
2 bbb     toys_and_paper  2000         52
3 aaa     toys_and_paper  2001        144
4 bbb     toys_and_paper  2001        173
5 aaa     toys_and_paper  2002        128
6 bbb     toys_and_paper  2002        128

或者这个不那么冗长的:

matrix %>%
  group_by(country, year) %>%
  summarise(country = country[1], 
            industry = paste(industry[1], "and", industry[2], sep = "_"), 
            year = year[1], 
            production = sum(production, na.rm = TRUE))