如何根据组成群体的个体对群体进行分类?
how to classify groups based on the individuals who compose them?
这是我的问题:我有一个个人数据库(每行 1 个个人)。每个人都属于一个家庭(由变量 ID_household
表示)并且具有一定的年龄(变量 age
)。我想要做的是创建一个新列 type
,根据组成同一家庭的个人的构成来定义家庭类型:
- 如果有2个大人(两个18岁以上的人),type变量取值“couple”;
- 如果有 1 名成人和至少 1 名未成年人且最小年龄相差 15 岁 = "单身 parent 家庭" ;
- 如果有 2 名成人和至少 1 名未成年人且最小年龄相差 15 岁 = "couple with children" ;
- 如果有一个人=“单身”。
这是导入数据的脚本。
ID_household
和 age
是原始列。 type
是我想创建的栏目,但我不知道该怎么做:
data <- data.frame(ID_household = c(1, 1, 2, 3, 3, 4, 5, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 11, 11),
age = c(31, 29, 36, 24, 34, 42, 19, 39, 6, 9, 42, 4, 6, 29, 34, 41, 12, 51, 26, 27, 1, 3),
type = c("couple", "couple", "single person", "couple", "couple", "single person", "single person",
"single parent family", "single parent family", "single parent family", "single person",
"couple with children", "couple with children", "couple with children", "couple with children",
"single parent family", "single parent family", "single person", "couple with children",
"couple with children", "couple with children", "couple with children"))
data
ID_household age type
1 1 31 couple
2 1 29 couple
3 2 36 single person
4 3 24 couple
5 3 34 couple
6 4 42 single person
7 5 19 single person
8 6 39 single parent family
9 6 6 single parent family
10 6 9 single parent family
11 7 42 single person
12 8 4 couple with children
13 8 6 couple with children
14 8 29 couple with children
15 8 34 couple with children
16 9 41 single parent family
17 9 12 single parent family
18 10 51 single person
19 11 26 couple with children
20 11 27 couple with children
21 11 1 couple with children
22 11 3 couple with children
我会通过创建关于儿童、成人和年龄差异的变量并使用 case_when()
来做到这一点。在下面的代码中,我将 type2
与数据集中的 type
变量进行比较:
data <- data %>%
group_by(ID_household) %>%
mutate(n_adult = sum(age > 18),
n_kids = sum(age <= 18),
min_adult_age = min(age[which(age > 18)]),
max_kid_age = ifelse(n_kids > 0, max(age[which(age <= 18)]), 0),
age_diff = min_adult_age - max_kid_age,
type2 = case_when(
n_adult == 2 & n_kids > 0 & age_diff >= 15 ~ "couple with children",
n_adult == 1 & n_kids > 0 & age_diff >= 15 ~ "single parent family",
n_adult == 2 & n_kids == 0 ~ "couple",
n_adult == 1 & n_kids == 0 ~ "single person",
TRUE ~ NA_character_)) %>%
select(-(n_adult:age_diff))
all(data$type == data$type2)
#[1] TRUE
这是 ave
的基本 R 方式。
type <- with(data, ave(age, ID_household, FUN = \(x){
if(length(x) < 2) {
"single person"
} else if(length(x) == 2L && all(x >= 18)) {
"couple"
} else if(sum(x >= 18) == 1){
"single parent family"
} else "couple with children"
}))
identical(data$type, type)
#[1] TRUE
这是我的问题:我有一个个人数据库(每行 1 个个人)。每个人都属于一个家庭(由变量 ID_household
表示)并且具有一定的年龄(变量 age
)。我想要做的是创建一个新列 type
,根据组成同一家庭的个人的构成来定义家庭类型:
- 如果有2个大人(两个18岁以上的人),type变量取值“couple”;
- 如果有 1 名成人和至少 1 名未成年人且最小年龄相差 15 岁 = "单身 parent 家庭" ;
- 如果有 2 名成人和至少 1 名未成年人且最小年龄相差 15 岁 = "couple with children" ;
- 如果有一个人=“单身”。
这是导入数据的脚本。
ID_household
和 age
是原始列。 type
是我想创建的栏目,但我不知道该怎么做:
data <- data.frame(ID_household = c(1, 1, 2, 3, 3, 4, 5, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 11, 11),
age = c(31, 29, 36, 24, 34, 42, 19, 39, 6, 9, 42, 4, 6, 29, 34, 41, 12, 51, 26, 27, 1, 3),
type = c("couple", "couple", "single person", "couple", "couple", "single person", "single person",
"single parent family", "single parent family", "single parent family", "single person",
"couple with children", "couple with children", "couple with children", "couple with children",
"single parent family", "single parent family", "single person", "couple with children",
"couple with children", "couple with children", "couple with children"))
data
ID_household age type
1 1 31 couple
2 1 29 couple
3 2 36 single person
4 3 24 couple
5 3 34 couple
6 4 42 single person
7 5 19 single person
8 6 39 single parent family
9 6 6 single parent family
10 6 9 single parent family
11 7 42 single person
12 8 4 couple with children
13 8 6 couple with children
14 8 29 couple with children
15 8 34 couple with children
16 9 41 single parent family
17 9 12 single parent family
18 10 51 single person
19 11 26 couple with children
20 11 27 couple with children
21 11 1 couple with children
22 11 3 couple with children
我会通过创建关于儿童、成人和年龄差异的变量并使用 case_when()
来做到这一点。在下面的代码中,我将 type2
与数据集中的 type
变量进行比较:
data <- data %>%
group_by(ID_household) %>%
mutate(n_adult = sum(age > 18),
n_kids = sum(age <= 18),
min_adult_age = min(age[which(age > 18)]),
max_kid_age = ifelse(n_kids > 0, max(age[which(age <= 18)]), 0),
age_diff = min_adult_age - max_kid_age,
type2 = case_when(
n_adult == 2 & n_kids > 0 & age_diff >= 15 ~ "couple with children",
n_adult == 1 & n_kids > 0 & age_diff >= 15 ~ "single parent family",
n_adult == 2 & n_kids == 0 ~ "couple",
n_adult == 1 & n_kids == 0 ~ "single person",
TRUE ~ NA_character_)) %>%
select(-(n_adult:age_diff))
all(data$type == data$type2)
#[1] TRUE
这是 ave
的基本 R 方式。
type <- with(data, ave(age, ID_household, FUN = \(x){
if(length(x) < 2) {
"single person"
} else if(length(x) == 2L && all(x >= 18)) {
"couple"
} else if(sum(x >= 18) == 1){
"single parent family"
} else "couple with children"
}))
identical(data$type, type)
#[1] TRUE