在长数据集中添加两个分类变量的行?
Adding rows of two categorical variables in a long dataset?
我有一个长格式矩阵(面板数据),其中包含几个字符串变量、一个分类变量和一个具有数值的变量。
该数据包含特定年份各国多个工业部门的产出信息。我的想法是在同一国家/地区的同一年份添加其中两个行业,并更改新创建行业的名称。
例如,假设我有以下矩阵:
set.seed(10)
matrix <- cbind.data.frame(country = rep(c("aaa" , "bbb") , each = 6) , industry = rep(c("toys" , "paper") , each = 3 , times = 2) ,
year = rep(c(2000:2002) , times = 4) , production = sample(0:100 , 12) )
给出:
country industry year production
[1,] "aaa" "toys" "2000" "8"
[2,] "aaa" "toys" "2001" "73"
[3,] "aaa" "toys" "2002" "75"
[4,] "aaa" "paper" "2000" "54"
[5,] "aaa" "paper" "2001" "71"
[6,] "aaa" "paper" "2002" "53"
[7,] "bbb" "toys" "2000" "38"
[8,] "bbb" "toys" "2001" "82"
[9,] "bbb" "toys" "2002" "87"
[10,] "bbb" "paper" "2000" "14"
[11,] "bbb" "paper" "2001" "91"
[12,] "bbb" "paper" "2002" "41"
我想将每年和每个国家/地区的“玩具”生产与“造纸”生产相加,并将新行业称为“玩具和造纸”:
year country variable value
1 2000 aaa toys_and_paper 62
2 2000 bbb toys_and_paper 52
3 2001 aaa toys_and_paper 144
4 2001 bbb toys_and_paper 173
5 2002 aaa toys_and_paper 128
6 2002 bbb toys_and_paper 128
我知道如何使用 reshape2 和 tidyverse 来做到这一点:
library(reshape2)
library(tidyverse)
test <- dcast(matrix , year + country ~ industry)
test <- test %>%
mutate(toys_and_paper = paper + toys) %>%
select(year , country , toys_and_paper)
test <- melt(test , id.vars = c("year" , "country"))
有更直接的方法吗?
使用 aggregate
的基础 R 选项
aggregate(
production ~ .,
transform(
matrix,
industry = ave(industry,
country,
year,
FUN = function(v) paste0(v, collapse = "_and_")
)
), sum
)
给予
country industry year production
1 aaa toys_and_paper 2000 62
2 bbb toys_and_paper 2000 52
3 aaa toys_and_paper 2001 144
4 bbb toys_and_paper 2001 173
5 aaa toys_and_paper 2002 128
6 bbb toys_and_paper 2002 128
我认为原来的问题有一个令人困惑的例子,因为在实际数据集中可能有更多的行业。
这里有一个包含两个以上行业的玩具数据集
set.seed(10)
matrix <- cbind.data.frame(
country = rep(c("aaa", "bbb"), each = 9),
industry = rep(c("toys", "paper", "other"), each = 3, times = 2),
year = rep(c(2000:2002), times = 6),
production = sample(0:100, 18)
)
以及dplyr
问题的解决方案
matrix %>%
dplyr::mutate(
industry = dplyr::if_else(
industry %in% c("toys", "paper"), "toys_and_paper", industry
)
) %>%
dplyr::group_by(
year,
country,
industry
) %>%
dplyr::summarise(
production = sum(production),
.groups = "drop"
)
如果在实际数据集中 industry
列是一个因素(应该如此),那么您可以将 if_else()
语句替换为 forcats::fct_unify()
你也可以使用这个:
library(dplyr)
library(purrr)
matrix %>%
group_split(country, year) %>%
map_dfr(~ .x %>%
add_row(country = .x$country[1],
industry = paste(.x$industry[1], "and", .x$industry[2], sep = "_"),
year = .x$year[1],
production = sum(.x$production, na.rm = TRUE))) %>%
filter(industry == "toys_and_paper") %>%
arrange(year)
# A tibble: 6 x 4
country industry year production
<chr> <chr> <int> <int>
1 aaa toys_and_paper 2000 62
2 bbb toys_and_paper 2000 52
3 aaa toys_and_paper 2001 144
4 bbb toys_and_paper 2001 173
5 aaa toys_and_paper 2002 128
6 bbb toys_and_paper 2002 128
或者这个不那么冗长的:
matrix %>%
group_by(country, year) %>%
summarise(country = country[1],
industry = paste(industry[1], "and", industry[2], sep = "_"),
year = year[1],
production = sum(production, na.rm = TRUE))
我有一个长格式矩阵(面板数据),其中包含几个字符串变量、一个分类变量和一个具有数值的变量。
该数据包含特定年份各国多个工业部门的产出信息。我的想法是在同一国家/地区的同一年份添加其中两个行业,并更改新创建行业的名称。
例如,假设我有以下矩阵:
set.seed(10)
matrix <- cbind.data.frame(country = rep(c("aaa" , "bbb") , each = 6) , industry = rep(c("toys" , "paper") , each = 3 , times = 2) ,
year = rep(c(2000:2002) , times = 4) , production = sample(0:100 , 12) )
给出:
country industry year production
[1,] "aaa" "toys" "2000" "8"
[2,] "aaa" "toys" "2001" "73"
[3,] "aaa" "toys" "2002" "75"
[4,] "aaa" "paper" "2000" "54"
[5,] "aaa" "paper" "2001" "71"
[6,] "aaa" "paper" "2002" "53"
[7,] "bbb" "toys" "2000" "38"
[8,] "bbb" "toys" "2001" "82"
[9,] "bbb" "toys" "2002" "87"
[10,] "bbb" "paper" "2000" "14"
[11,] "bbb" "paper" "2001" "91"
[12,] "bbb" "paper" "2002" "41"
我想将每年和每个国家/地区的“玩具”生产与“造纸”生产相加,并将新行业称为“玩具和造纸”:
year country variable value
1 2000 aaa toys_and_paper 62
2 2000 bbb toys_and_paper 52
3 2001 aaa toys_and_paper 144
4 2001 bbb toys_and_paper 173
5 2002 aaa toys_and_paper 128
6 2002 bbb toys_and_paper 128
我知道如何使用 reshape2 和 tidyverse 来做到这一点:
library(reshape2)
library(tidyverse)
test <- dcast(matrix , year + country ~ industry)
test <- test %>%
mutate(toys_and_paper = paper + toys) %>%
select(year , country , toys_and_paper)
test <- melt(test , id.vars = c("year" , "country"))
有更直接的方法吗?
使用 aggregate
aggregate(
production ~ .,
transform(
matrix,
industry = ave(industry,
country,
year,
FUN = function(v) paste0(v, collapse = "_and_")
)
), sum
)
给予
country industry year production
1 aaa toys_and_paper 2000 62
2 bbb toys_and_paper 2000 52
3 aaa toys_and_paper 2001 144
4 bbb toys_and_paper 2001 173
5 aaa toys_and_paper 2002 128
6 bbb toys_and_paper 2002 128
我认为原来的问题有一个令人困惑的例子,因为在实际数据集中可能有更多的行业。
这里有一个包含两个以上行业的玩具数据集
set.seed(10)
matrix <- cbind.data.frame(
country = rep(c("aaa", "bbb"), each = 9),
industry = rep(c("toys", "paper", "other"), each = 3, times = 2),
year = rep(c(2000:2002), times = 6),
production = sample(0:100, 18)
)
以及dplyr
问题的解决方案
matrix %>%
dplyr::mutate(
industry = dplyr::if_else(
industry %in% c("toys", "paper"), "toys_and_paper", industry
)
) %>%
dplyr::group_by(
year,
country,
industry
) %>%
dplyr::summarise(
production = sum(production),
.groups = "drop"
)
如果在实际数据集中 industry
列是一个因素(应该如此),那么您可以将 if_else()
语句替换为 forcats::fct_unify()
你也可以使用这个:
library(dplyr)
library(purrr)
matrix %>%
group_split(country, year) %>%
map_dfr(~ .x %>%
add_row(country = .x$country[1],
industry = paste(.x$industry[1], "and", .x$industry[2], sep = "_"),
year = .x$year[1],
production = sum(.x$production, na.rm = TRUE))) %>%
filter(industry == "toys_and_paper") %>%
arrange(year)
# A tibble: 6 x 4
country industry year production
<chr> <chr> <int> <int>
1 aaa toys_and_paper 2000 62
2 bbb toys_and_paper 2000 52
3 aaa toys_and_paper 2001 144
4 bbb toys_and_paper 2001 173
5 aaa toys_and_paper 2002 128
6 bbb toys_and_paper 2002 128
或者这个不那么冗长的:
matrix %>%
group_by(country, year) %>%
summarise(country = country[1],
industry = paste(industry[1], "and", industry[2], sep = "_"),
year = year[1],
production = sum(production, na.rm = TRUE))