对行值求和并创建新类别
Sum row values and create new category
我正在处理一个人口数据框,我有不同年份和年龄组的信息,这些信息按五年划分。一旦我过滤了我感兴趣的位置的信息,我就有了这个:
Location age group total90 total95 total00 total05 total10
A 0 to 4 10428 118902 76758 967938 205472
A 5 to 9 18530 238928 260331 277635 303180
A 10 to 14 180428 208902 226758 267938 305472
A 15 to 19 185003 332089 242267 261793 135472
现在我想要的是创建新的年龄组来拥有这样的东西:
Location age group total90 total95 total00 total05 total10
A 5 to 14 198958 447830 487089 545573 608652
A other 195431 450991 319025 1229731 340944
哪里
年龄段“5 到 14”是每年“5 到 9”+“10 到 14”的总和 &
“其他”是每年“0到4”+“15到19”的总和
我尝试选择带有数字的列,这样我就可以添加每个年龄组的总数并创建包含新年龄组的行,但我无法以简单的方式添加行,而且我让事情变得更加复杂。我确定有一种简单的方法可以解决这个问题,但我被卡住了。
看下面我的回答:
我的第一行读入了显示的数据。
library(tidyverse)
#read in data
my_data <- read_csv("pop_data.csv")
#add extra tags
my_data1 <- my_data %>%
mutate(Category = c("other","5 to 14","5 to 14","other")) %>%
select(-`age group`)
#find numeric columns
numeric_col <- unlist(lapply(my_data1, is.numeric))
#combine the data
my_data2 <- aggregate(my_data1[,numeric_col],
by = list(my_data1$Location, my_data1$Category),
FUN = sum)
#rename first 2 columns
colnames(my_data2)[1:2] <- c("Location", "age group")
结果:
Location age group total90 total95 total00 total05 total10
1 A 5 to 14 198958 447830 487089 545573 608652
2 A other 195431 450991 319025 1229731 340944
我不得不稍微改变你的虚拟数据(只是删除了一些 space 以便于阅读纯文本)使其在没有进一步操作的情况下工作
df <- data.table::fread("Location age_group total90 total95 total00 total05 total10
A 0_to_4 10428 118902 76758 967938 205472
A 5_to_9 18530 238928 260331 277635 303180
A 10_to_14 180428 208902 226758 267938 305472
A 15_to_19 185003 332089 242267 261793 135472")
library(tidyverse)
df %>%
# alter the character variable age_group reducing problem to one ifelse clause
dplyr::mutate(age_group = ifelse(age_group == "5_to_9" | age_group == "10_to_14", "5_to_14", "other")) %>%
# build grouping (I included Location but possibly your need is diferent)
dplyr::group_by(Location, age_group) %>%
# sum in one call all not grouped columns (therefore you have to remove Location in case you do not want it in the grouping
dplyr::summarize(across(everything(), ~sum(.x))) %>%
# ungrouping prevents unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 2 x 7
Location age_group total90 total95 total00 total05 total10
<chr> <chr> <int> <int> <int> <int> <int>
1 A 5_to_14 198958 447830 487089 545573 608652
2 A other 195431 450991 319025 1229731 340944
dplyr
的解决方案
您可以先mutate
将age group
列放入感兴趣的范围内,然后summarise
across
感兴趣的列,用sum
函数
library(dplyr)
df %>% mutate(`age group` = ifelse(`age group` %in% c(`5 to 9`, `10 to 14`), `5 to 14`, 'other') %>%
group_by(`age group`, location) %>%
summarise(across(total90:total10), sum))%>%
ungroup()
为了完整起见 - 如果您想更改目标 start/end 年龄,这里有一种参数化 target_start
和 target_end
的方法:
library(tidyverse)
target_start <- 5
target_end <- 14
df %>%
separate(`age group`, into = c("grp_start", "grp_end"), sep = " to ") %>%
mutate(across(starts_with("grp"), as.numeric),
age_group =
if_else(grp_start >= target_start & grp_end <= target_end,
glue::glue("{target_start} to {target_end}"),
"other")
) %>%
group_by(age_group, Location) %>%
summarise(across(total90:total10, sum)) %>%
ungroup()
我正在处理一个人口数据框,我有不同年份和年龄组的信息,这些信息按五年划分。一旦我过滤了我感兴趣的位置的信息,我就有了这个:
Location age group total90 total95 total00 total05 total10
A 0 to 4 10428 118902 76758 967938 205472
A 5 to 9 18530 238928 260331 277635 303180
A 10 to 14 180428 208902 226758 267938 305472
A 15 to 19 185003 332089 242267 261793 135472
现在我想要的是创建新的年龄组来拥有这样的东西:
Location age group total90 total95 total00 total05 total10
A 5 to 14 198958 447830 487089 545573 608652
A other 195431 450991 319025 1229731 340944
哪里
年龄段“5 到 14”是每年“5 到 9”+“10 到 14”的总和 &
“其他”是每年“0到4”+“15到19”的总和
我尝试选择带有数字的列,这样我就可以添加每个年龄组的总数并创建包含新年龄组的行,但我无法以简单的方式添加行,而且我让事情变得更加复杂。我确定有一种简单的方法可以解决这个问题,但我被卡住了。
看下面我的回答:
我的第一行读入了显示的数据。
library(tidyverse)
#read in data
my_data <- read_csv("pop_data.csv")
#add extra tags
my_data1 <- my_data %>%
mutate(Category = c("other","5 to 14","5 to 14","other")) %>%
select(-`age group`)
#find numeric columns
numeric_col <- unlist(lapply(my_data1, is.numeric))
#combine the data
my_data2 <- aggregate(my_data1[,numeric_col],
by = list(my_data1$Location, my_data1$Category),
FUN = sum)
#rename first 2 columns
colnames(my_data2)[1:2] <- c("Location", "age group")
结果:
Location age group total90 total95 total00 total05 total10
1 A 5 to 14 198958 447830 487089 545573 608652
2 A other 195431 450991 319025 1229731 340944
我不得不稍微改变你的虚拟数据(只是删除了一些 space 以便于阅读纯文本)使其在没有进一步操作的情况下工作
df <- data.table::fread("Location age_group total90 total95 total00 total05 total10
A 0_to_4 10428 118902 76758 967938 205472
A 5_to_9 18530 238928 260331 277635 303180
A 10_to_14 180428 208902 226758 267938 305472
A 15_to_19 185003 332089 242267 261793 135472")
library(tidyverse)
df %>%
# alter the character variable age_group reducing problem to one ifelse clause
dplyr::mutate(age_group = ifelse(age_group == "5_to_9" | age_group == "10_to_14", "5_to_14", "other")) %>%
# build grouping (I included Location but possibly your need is diferent)
dplyr::group_by(Location, age_group) %>%
# sum in one call all not grouped columns (therefore you have to remove Location in case you do not want it in the grouping
dplyr::summarize(across(everything(), ~sum(.x))) %>%
# ungrouping prevents unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 2 x 7
Location age_group total90 total95 total00 total05 total10
<chr> <chr> <int> <int> <int> <int> <int>
1 A 5_to_14 198958 447830 487089 545573 608652
2 A other 195431 450991 319025 1229731 340944
dplyr
的解决方案您可以先mutate
将age group
列放入感兴趣的范围内,然后summarise
across
感兴趣的列,用sum
函数
library(dplyr)
df %>% mutate(`age group` = ifelse(`age group` %in% c(`5 to 9`, `10 to 14`), `5 to 14`, 'other') %>%
group_by(`age group`, location) %>%
summarise(across(total90:total10), sum))%>%
ungroup()
为了完整起见 - 如果您想更改目标 start/end 年龄,这里有一种参数化 target_start
和 target_end
的方法:
library(tidyverse)
target_start <- 5
target_end <- 14
df %>%
separate(`age group`, into = c("grp_start", "grp_end"), sep = " to ") %>%
mutate(across(starts_with("grp"), as.numeric),
age_group =
if_else(grp_start >= target_start & grp_end <= target_end,
glue::glue("{target_start} to {target_end}"),
"other")
) %>%
group_by(age_group, Location) %>%
summarise(across(total90:total10, sum)) %>%
ungroup()