捕获列中使用 "to" 指示的范围 - tidycensus (U.S.Census API)

Capturing the range indicated by the use of "to" in a column - tidycensus (U.S. Census API)

如何在一个列中捕获所有年龄,其中一个组的值为“20 到 24 岁”,另一组的值为“22 到 24 岁”?这将使我能够确认我在 tidycensus(R 包)U.S 中捕获了所有工作年龄 (18-64) 变量名称。人口普查API查询。

目标

我想要的是,对于这个例子中的 20-24 岁,一个从标签条目中提取年龄的数据框,如 "22 to 24 years":

MEN  WOMEN ETHNORACE
18   18    BLACK
19   19    BLACK
20   20    BLACK
21   21    BLACK
22   22    BLACK
23   23    BLACK
24         BLACK

然后我可以轻松地创建一个包含所有年龄的数据框并进行比较以查看是否遗漏任何内容。

人口普查变量 (tidycensus)

可以在 https://api.census.gov/data/2019/acs/acs5/variables.html 看到至少 U.S 的美国社区调查 (ACS)。人口普查的年龄范围字段具有不同的语法(例如“20 岁”和“22 至 24 岁”):

来自 tidycensus 包的 load_variables 函数的示例行

tidycensus R 包版本 1.1

## Example rows from tidycensus using:
library(tidycensus)
library(magrittr)
library(stringr)

v19     <- load_variables(2019, "acs5", cache = TRUE)
v19 %>% 
  dplyr::filter(
    str_detect(label, "18|20|24") & 
               concept %in% c("SEX BY AGE",
                              "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
               grepl('FEMALE', toupper(label))
                )

v19_Total_AndBlack_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, "18|20|24") & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE', toupper(label)))

 print(v19_Total_AndBlack_Age18_24)

  name        label                                      concept                                     
  <chr>       <chr>                                      <chr>                                       
1 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
2 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
3 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
4 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
5 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...

在这个例子中,我想确保 18-24 的每个年龄 TotalBlack populations 存在于如下数据框中 - 注意使用来自上述 v19_Total_AndBlack_Age18_24.

的人口普查 API 名称
v19_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
 )

请注意 Total "22 to 24 years" 与 Black "20 到 24 岁".

让我们关注上面的数据框 v19_Total_AndBlack_Age18_24,它列出了 18 - 24 岁的人口普查 API 姓名和标签,旨在确认所有年份都存在。

我可以通过正则表达式获取所有年龄段的数字:

unlist(str_extract_all(v19_Total_AndBlack_Age18_24$label,"\d{2}"))
[1] "18" "19" "20" "22" "24" "18" "19" "20" "24"

但是我尝试按类别分组的尝试失败了,当单词“to”出现在“20 到 24”中时,我仍然需要获得一个跨越年龄范围的向量。

v19_Total_AndBlack_Age18_24_grp <- 
  v19_Total_AndBlack_Age18_24 %>% 
   mutate(EthnoRace = case_when(
   grepl('BLACK', concept) ~ "BLACK",
   TRUE ~ "TOTAL"))

v19_Total_AndBlack_Age18_24_grp %>% 
  group_by(EthnoRace) %>% 
  mutate(ages = str_extract_all(label, "\d{2"))

错误

Error: Problem with `mutate()` column `ages`.
i `ages = str_extract_all(label, "\d{2")`.
x Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`\d{2`)
i The error occurred in group 1: Group = "TOTAL".

我倾向于分两步进行。第一步指定元数据文件中的一些特征。第二步将元数据应用于问题。

您似乎需要近似一些,因为级别不完全符合您的界限。例如,“15 至 19 岁”跨越 18.

(为了简单起见,我假设您永远不想保留“20 到 24 岁”级别,但排除“20 到 24 岁 " 级别。

# Step 1a: create a list of poential age labels
v19 |> 
  dplyr::mutate(
    concept_age = grepl(pattern = "AGE$", concept),       # Concept must end with "AGE"
  ) |> 
  dplyr::filter(concept_age) |> 
  tidyr::separate_rows(label, sep = "!!")  |>             # Isolate the different dimensions of a variable
  dplyr::rename(level = label) |> 
  dplyr::mutate(
    level_year  = grepl(pattern = "\byears?\b", level), # Label must contain "year" or "years"
  ) |> 
  dplyr::filter(level_year) |> 
  dplyr::count(level, name = "variable_count") |>         # Reduce to the unique (overlapping) age levels
  dplyr::mutate(
    desired = TRUE                                        # Create variable to manually toggle in Step 2
  ) |> 
  dplyr::arrange(level) |>                                # Careful this is still a string, so "26" precedes "3"
  # View()
  readr::write_csv(path_metadata_age_label)

# Step 1b: Manual edit the 78 `desired` values in the csv & save.

第一步输出:

level,variable_count,desired
10 to 14 years,2,TRUE
12 to 14 years,5,TRUE
12 to 17 years,5,TRUE
15 to 17 years,45,TRUE
15 to 19 years old,6,TRUE
...
# Step 2a: Read your metadata, retaining only the desired age levels.
pattern_age <-
  path_metadata_age_label |> 
  readr::read_csv() |> 
  dplyr::filter(desired) |> 
  dplyr::mutate(
    level = paste0("\b", level, "\b")   # Text starts & stops with a word boundary
  ) |> 
  dplyr::pull(level) |> 
  paste(collapse = "|")

# Step 2b: Apply the age levels to `v19`
v19 |> 
  dplyr::mutate(
     keep = grepl(pattern_age, label, perl = T)
  ) |> 
  dplyr::filter(keep)

第 2 步输出:

   name       label                                    concept    keep 
   <chr>      <chr>                                    <chr>      <lgl>
 1 B01001_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE TRUE 
 2 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE TRUE 
 3 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE TRUE 
 4 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE TRUE 
...

首先,人们需要哪些年龄、性别和种族群体的数据?这可以修改为只选择一种性别。 Gender_var 需要出现在 https://api.census.gov/data/2019/acs/acs5/variables.html 处的标签列中(或者可以在创建数据框 v19 时像下面那样使用 load_variables())。

参数

只需将代码设置为 运行。

min_age_desired <- 18
max_age_desired <- 24
Gender_var = c("MALE", "FEMALE")
EthnoRace_var = c("BLACK","TOTAL")

现在,让我们创建一个 QC 数据框,其中包含我们需要的所有民族和年龄组。

加载 R 包

library(arsenal)
library(dplyr)
library(stringr)
library(tidyr)
library(tidycensus)

options(scipen = 8)

验证数据框

拥有人口普查中想要的所有年龄和种族群体API

AGE_var  = as.numeric(seq(min_age_desired, max_age_desired, 1))
all_grp_qc_frm <- 
  data.frame(
    ## dupli
    expand.grid(EthnoRace_var,
                Gender_var,
                AGE_var
          )
  )

colnames(all_grp_qc_frm) <- 
  c("EthnoRace", "Gender", "AGE")

all_grp_qc_frm$AGE <- as.numeric(
  all_grp_qc_frm$AGE)
all_grp_qc_frm$EthnoRace <- as.character(
  all_grp_qc_frm$EthnoRace)
all_grp_qc_frm$Gender <- as.character(
  all_grp_qc_frm$Gender)

all_grp_qc_frm <- all_grp_qc_frm %>% 
  arrange(EthnoRace,Gender,AGE)

print(all_grp_qc_frm)

   EthnoRace Gender AGE
1      BLACK   MALE  18
2      BLACK   MALE  19
3      BLACK   MALE  20
4      BLACK   MALE  21
5      BLACK   MALE  22
6      BLACK   MALE  23
7      BLACK   MALE  24
8      BLACK FEMALE  18
9      BLACK FEMALE  19
10     BLACK FEMALE  20
11     BLACK FEMALE  21
12     BLACK FEMALE  22
13     BLACK FEMALE  23
14     BLACK FEMALE  24
15     TOTAL   MALE  18
16     TOTAL   MALE  19
17     TOTAL   MALE  20
18     TOTAL   MALE  21
19     TOTAL   MALE  22
20     TOTAL   MALE  23
21     TOTAL   MALE  24
22     TOTAL FEMALE  18
23     TOTAL FEMALE  19
24     TOTAL FEMALE  20
25     TOTAL FEMALE  21
26     TOTAL FEMALE  22
27     TOTAL FEMALE  23
28     TOTAL FEMALE  24

使用 tidycensus 加载人口普查变量

这是 2019 年美国社区调查的 5 年估计值

v19     <- load_variables(2019, "acs5", cache = TRUE)

将这些变量子集化为需要的变量

通过人口普查有很多变量可用 API。

要进行子集化,我们首先获取一个向量,每个年龄从 18 岁到 24 岁,用竖线分隔。

working_age_vec <- paste0(seq(18,24,1), collapse = "|")

请注意,我需要使用正确的概念值来获取 Black 和跨种族群体的总人口。

v19_Total_And_EthnoRace_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, working_age_vec) & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE|MALE', toupper(label)))

print(v19_Total_And_EthnoRace_Age18_24)

# A tibble: 12 x 3
   name        label                                      concept                                     
   <chr>       <chr>                                      <chr>                                       
 1 B01001_007  Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE                                  
 2 B01001_008  Estimate!!Total:!!Male:!!20 years          SEX BY AGE                                  
 3 B01001_009  Estimate!!Total:!!Male:!!21 years          SEX BY AGE                                  
 4 B01001_010  Estimate!!Total:!!Male:!!22 to 24 years    SEX BY AGE                                  
 5 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
 6 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
 7 B01001_033  Estimate!!Total:!!Female:!!21 years        SEX BY AGE                                  
 8 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
 9 B01001B_007 Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
10 B01001B_008 Estimate!!Total:!!Male:!!20 to 24 years    SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
11 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
12 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)

使用人口普查提取这些变量API

Census_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
  )

获取类似 SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE) 的概念值 和标签值,如 Estimate!!Total:!!Male:!!10 to 14 years.

Census_Total_AndBlack_Age18_24 <- left_join(
  Census_Total_AndBlack_Age18_24, 
  v19 %>% 
    select(name, concept, label) %>% 
    rename(variable = name)
 )

正则表达式抽取+序列

用于提取年龄并创建表示每个范围从最低年龄到最高年龄序列的向量的正则表达式。

Census_Total_AndBlack_Age18_24_grp <- 
  Census_Total_AndBlack_Age18_24 %>%
  distinct(label, concept) %>% 
    ## regular expression to extract all the numbers in labels like
    ## Estimate!!Total:!!Male:!!5 to 9 years
    mutate(ages = sapply(str_extract_all(label,"\d{2}"),
                            function(x) paste(x,collapse=""))) %>% 
    mutate(start = str_sub(ages, 1, 2),
             end = str_sub(ages, 3, 4)) %>% 
    mutate(
      start = case_when(
            is.na(start) ~ "99",
            TRUE ~ start),
      end = case_when(
            is.na(end) ~ "99",
            TRUE ~ end)) %>% 
    dplyr::filter(grepl('Female|Male', label)) %>% 
    mutate(Gender = case_when(
      grepl('Female', label) ~ "FEMALE",
      grepl('Male', label) ~ "MALE",
      TRUE ~ "MISSING")) %>% 
    mutate(EthnoRace = case_when(
      grepl('BLACK', concept) ~ "BLACK",
      TRUE ~ "TOTAL")) %>% 
    mutate(end = case_when(
      is.na(end) | end == "" ~ start,
      TRUE ~ end))
    
    Census_Total_AndBlack_Age18_24_grp_sum <- Census_Total_AndBlack_Age18_24_grp %>% 
      group_by(EthnoRace, Gender) %>% 
      summarize(AGE = as.numeric(unlist(purrr::map2(start, end, `:`)))) %>% 
      ungroup() %>% 
      distinct(EthnoRace, Gender, AGE)

质控比较

请记住,所需的种族群体、性别群体和年龄打印在该解决方案的顶部。

arsenal::comparedf(all_grp_qc_frm, Census_Total_AndBlack_Age18_24_grp_sum)

[...]

Not shared: 0 variables and 0 observations.

Differences found in 0/3 variables compared.
0 variables compared have non-identical attributes.