根据年龄和成员id创建户主

Create household head based on age and member id

我有一个包含 3 个整数列的家庭成员数据框,'hid'、'sub' 和 'age'。我想在名为 'hh' 的数据框中创建一个新的逻辑变量,代表户主,定义如下:

  1. 如果家庭中只有 1 名成员,则值为 TRUE,
  2. 如果家庭中有 2 名或更多成员,则户主为年龄在 18 至 65 岁(含)之间且在 18 至 65 岁(含)之间且 subject id 最小的人('sub') 18 岁和 65 岁。
  3. 如果家庭中没有18-65岁的成员,则户主是subject id最小的人。

每户必须有 1 位户主。

我的数据看起来像这样:

# A tibble: 10 x 3
     hid   sub   age
   <dbl> <dbl> <dbl>
 1     1     1    75
 2     1     2    55
 3     2     1    35
 4     3     1    69
 5     3     2    72
 6     4     1    69
 7     5     1    15
 8     5     2    17
 9     5     3    42
10     6     1    72

我希望结果是这样的:

> result
# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE  # Not 18-65 & there is another aged 18-65 within this household.
 2     1     2    55 TRUE   # Aged 18-65 and the smallest sub id within this household.
 3     2     1    35 TRUE   # Only 1 in this household.
 4     3     1    69 TRUE   # Not aged 18-65, but no other member is and smallest sub id.
 5     3     2    72 FALSE  # Not aged 18-65, and not the smallest sub id.
 6     4     1    69 TRUE   # Only 1 in this household.
 7     5     1    15 FALSE  # Not aged 18-65 and others in this household qualify.
 8     5     2    17 FALSE  # Not aged 18-65 and others in this household qualify.
 9     5     3    42 TRUE   # Aged 18-65 and the smallest sub id among those aged 18-65 within this household.
10     5     4    62 FALSE  # Aged 18-65 but not the smallest sub id among those aged 18-65 within this household.

谢谢!


d <- structure(list(hid = c(1, 1, 2, 3, 3, 4, 5, 5, 5, 5), 
                      sub = c(1, 2, 1, 1, 2, 1, 1, 2, 3, 4),
                      age = c(75, 55, 35, 69, 72, 69, 15, 17, 42, 62)), 
                 row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

这里有一个选项

library(dplyr)
d %>% 
    group_by(hid) %>%
     mutate(hh = if(n() == 1) TRUE else if(n() > 1 & 
         !any(between(age, 18, 65))) age == min(age) else
        age == min(age[between(age, 18, 65)])) %>%
    ungroup

-输出

# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE
 2     1     2    55 TRUE 
 3     2     1    35 TRUE 
 4     3     1    69 TRUE 
 5     3     2    72 FALSE
 6     4     1    69 TRUE 
 7     5     1    15 FALSE
 8     5     2    17 FALSE
 9     5     3    42 TRUE 
10     5     4    62 FALSE

或者另一个简化的选项是

d %>% 
    mutate(rn = row_number()) %>%
    arrange(hid, sub, age) %>%
    group_by(hid) %>% 
    mutate(hh = age == coalesce(age[between(age, 18, 65)][1], 
           first(age))) %>% 
    ungroup %>%
    arrange(rn) %>%
    select(-rn)

-输出

# A tibble: 10 x 4
     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE
 2     1     2    55 TRUE 
 3     2     1    35 TRUE 
 4     3     1    69 TRUE 
 5     3     2    72 FALSE
 6     4     1    69 TRUE 
 7     5     1    15 FALSE
 8     5     2    17 FALSE
 9     5     3    42 TRUE 
10     5     4    62 FALSE

您可以 arrange 数据,使每组的第一行是您要查找的 hh 值。

library(dplyr)

d %>%
  arrange(hid, !between(age, 18, 65), sub) %>%
  mutate(hh = !duplicated(hid)) 

#     hid   sub   age hh   
#   <dbl> <dbl> <dbl> <lgl>
# 1     1     2    55 TRUE 
# 2     1     1    75 FALSE
# 3     2     1    35 TRUE 
# 4     3     1    69 TRUE 
# 5     3     2    72 FALSE
# 6     4     1    69 TRUE 
# 7     5     3    42 TRUE 
# 8     5     4    62 FALSE
# 9     5     1    15 FALSE
#10     5     2    17 FALSE          

!between(age, 18, 65) 会安排数据,将 18-65 岁的人放在第一位,然后再排在范围外的其他人之前。

带有case_when的选项, 每个 case_when 正在将您的条件 1 到 3 翻译成代码:

library(dplyr)

d %>% 
    group_by(hid) %>% 
    mutate(hh = case_when(max(sub) == 1 ~ TRUE,
                          max(sub) > 1 & 
                              between(age, 18, 65) &
                              sub == min(sub[between(age, 18, 65)]) ~ TRUE,
                          max(between(age, 18, 65)) < 1 & 
                              sub == min(sub[max(between(age, 18, 65)) < 1]) ~ TRUE,
                          TRUE ~ FALSE))

输出:

     hid   sub   age hh   
   <dbl> <dbl> <dbl> <lgl>
 1     1     1    75 FALSE
 2     1     2    55 TRUE 
 3     2     1    35 TRUE 
 4     3     1    69 TRUE 
 5     3     2    72 FALSE
 6     4     1    69 TRUE 
 7     5     1    15 FALSE
 8     5     2    17 FALSE
 9     5     3    42 TRUE 
10     5     4    62 FALSE