如何根据组成群体的个体对群体进行分类?

how to classify groups based on the individuals who compose them?

这是我的问题:我有一个个人数据库(每行 1 个个人)。每个人都属于一个家庭(由变量 ID_household 表示)并且具有一定的年龄(变量 age)。我想要做的是创建一个新列 type,根据组成同一家庭的个人的构成来定义家庭类型:

这是导入数据的脚本。 ID_householdage 是原始列。 type 是我想创建的栏目,但我不知道该怎么做:

data <- data.frame(ID_household = c(1, 1, 2, 3, 3, 4, 5, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 11, 11),
           age = c(31, 29, 36, 24, 34, 42, 19, 39, 6, 9, 42, 4, 6, 29, 34, 41, 12, 51, 26, 27, 1, 3),
           type = c("couple", "couple", "single person", "couple", "couple", "single person", "single person",
                    "single parent family", "single parent family", "single parent family", "single person",
                    "couple with children", "couple with children", "couple with children", "couple with children", 
                    "single parent family", "single parent family", "single person", "couple with children",
                    "couple with children", "couple with children", "couple with children"))

data
   ID_household age                 type
1             1  31               couple
2             1  29               couple
3             2  36        single person
4             3  24               couple
5             3  34               couple
6             4  42        single person
7             5  19        single person
8             6  39 single parent family
9             6   6 single parent family
10            6   9 single parent family
11            7  42        single person
12            8   4 couple with children
13            8   6 couple with children
14            8  29 couple with children
15            8  34 couple with children
16            9  41 single parent family
17            9  12 single parent family
18           10  51        single person
19           11  26 couple with children
20           11  27 couple with children
21           11   1 couple with children
22           11   3 couple with children

我会通过创建关于儿童、成人和年龄差异的变量并使用 case_when() 来做到这一点。在下面的代码中,我将 type2 与数据集中的 type 变量进行比较:

data <- data %>% 
  group_by(ID_household) %>% 
  mutate(n_adult = sum(age > 18), 
         n_kids = sum(age <= 18),
         min_adult_age  = min(age[which(age > 18)]), 
         max_kid_age = ifelse(n_kids > 0, max(age[which(age <= 18)]), 0),  
         age_diff = min_adult_age - max_kid_age, 
         type2 = case_when(
            n_adult == 2 & n_kids > 0 & age_diff >= 15 ~ "couple with children", 
            n_adult == 1 & n_kids > 0 & age_diff >= 15 ~ "single parent family", 
            n_adult == 2 & n_kids == 0 ~ "couple",
            n_adult == 1 & n_kids == 0 ~ "single person", 
            TRUE ~ NA_character_)) %>% 
  select(-(n_adult:age_diff))

all(data$type == data$type2)           
#[1] TRUE

这是 ave 的基本 R 方式。

type <- with(data, ave(age, ID_household, FUN = \(x){
  if(length(x) < 2) {
    "single person"
  } else if(length(x) == 2L && all(x >= 18)) {
    "couple"
  } else if(sum(x >= 18) == 1){
    "single parent family"
  } else "couple with children"
}))

identical(data$type, type)
#[1] TRUE