如何为每个变量使用不同值重新编码多个变量

How to recode multiple variables with different values for each variable

我有一个包含 100 多个变量的调查数据集,几乎所有变量都有 1-10 个代码值。每列的代码值在另一个 df 中提供。

示例数据:


survey_df = structure(list(resp_id = 1:5, gender = c("1", "2", "2", "1", 
"1"), state = c("1", "2", "3", "1", "4"), education = c("1", 
"1", "1", "2", "2")), class = "data.frame", row.names = c(NA, 
-5L))

coded_df = structure(list(col = c("state", "gender", "education"), col_values = c("1-CA,2-TX,3-AZ,4-CO", 
"1-Male,2-Female", "1-High School,2-Bachelor")), class = "data.frame", row.names = c(NA, 
-3L))

由于调查列发生了变化 time/product 我想避免任何硬编码重新编码,因此有一个函数可以输入列名和 return 来自 coded_df.

get_named_vec <- function(x) {
  tmp_chr <- coded_df %>%
    filter(col == x) %>%
    mutate(col_values = str_replace_all(col_values, "\n", "")) %>%
    separate_rows(col_values, sep = ",") %>%
    separate(col_values, into = c("var1", "var2"), sep = "-") %>%
    mutate(var1 = as.character(as.numeric(var1)), 
           var2 = str_trim(var2)) %>%
    pull(var2, var1)

  return(tmp_chr)
  
}

然后我使用如下命名向量来更新 survey_df。

survey_df%>%
 mutate(gender = recode(gender,!!!get_named_vec("gender"),.default = "NA_character_"))

到目前为止,这项工作是在每列的基础上进行的,这意味着执行了 100 多次!

但是我如何通过 mutate_at 制作这个 运行 以便我在单次执行中有选择地重新编码某些变量。

# This does not work.
to_update_col<-c("state","gender")
survey_df%>%
  mutate_at(.vars=all_of(to_update_col),.funs=function(x) recode(x,!!!get_named_vec(x))))

非常感谢任何帮助!

谢谢

维奈

我希望将其转换为数据透视-连接-数据透视操作会更简单、性能更高,您可以在其中将源和查找 tables 转换为长格式,加入它们,然后再次重新整形.

鉴于此调查信息:

survey_df = structure(list(resp_id = 1:5, 
                           gender = c(1L, 2L, 2L, 1L, 1L), 
                           state = c(1, 2, 3, 1, 4), 
                           education = c(1L, 1L, 1L, 2L, 2L)), class = "data.frame", row.names = c(NA, -5L)) %>%
 mutate(across(-resp_id, as.character))

我们可以将查找 table 转换为长格式:

coded_df_long <- coded_df %>%
  separate_rows(col_values, sep = ",") %>%
  separate(col_values, c("old", "new"), extra = "merge")

然后将调查转向长轴,加入编码,然后再次转向宽轴。

survey_df %>%
  pivot_longer(-resp_id) %>%
  left_join(coded_df_long, by = c("name" = "col", "value" = "old")) %>%
  select(-value) %>%
  pivot_wider(names_from = name, values_from = new)

结果

# A tibble: 5 x 4
  resp_id gender state education  
    <int> <chr>  <chr> <chr>      
1       1 Male   CA    High School
2       2 Female TX    High School
3       3 Female AZ    High School
4       4 Male   CA    Bachelor   
5       5 Male   CO    Bachelor