如何为每个变量使用不同值重新编码多个变量
How to recode multiple variables with different values for each variable
我有一个包含 100 多个变量的调查数据集,几乎所有变量都有 1-10 个代码值。每列的代码值在另一个 df 中提供。
示例数据:
survey_df = structure(list(resp_id = 1:5, gender = c("1", "2", "2", "1",
"1"), state = c("1", "2", "3", "1", "4"), education = c("1",
"1", "1", "2", "2")), class = "data.frame", row.names = c(NA,
-5L))
coded_df = structure(list(col = c("state", "gender", "education"), col_values = c("1-CA,2-TX,3-AZ,4-CO",
"1-Male,2-Female", "1-High School,2-Bachelor")), class = "data.frame", row.names = c(NA,
-3L))
由于调查列发生了变化 time/product 我想避免任何硬编码重新编码,因此有一个函数可以输入列名和 return 来自 coded_df.
get_named_vec <- function(x) {
tmp_chr <- coded_df %>%
filter(col == x) %>%
mutate(col_values = str_replace_all(col_values, "\n", "")) %>%
separate_rows(col_values, sep = ",") %>%
separate(col_values, into = c("var1", "var2"), sep = "-") %>%
mutate(var1 = as.character(as.numeric(var1)),
var2 = str_trim(var2)) %>%
pull(var2, var1)
return(tmp_chr)
}
然后我使用如下命名向量来更新 survey_df。
survey_df%>%
mutate(gender = recode(gender,!!!get_named_vec("gender"),.default = "NA_character_"))
到目前为止,这项工作是在每列的基础上进行的,这意味着执行了 100 多次!
但是我如何通过 mutate_at 制作这个 运行 以便我在单次执行中有选择地重新编码某些变量。
# This does not work.
to_update_col<-c("state","gender")
survey_df%>%
mutate_at(.vars=all_of(to_update_col),.funs=function(x) recode(x,!!!get_named_vec(x))))
非常感谢任何帮助!
谢谢
维奈
我希望将其转换为数据透视-连接-数据透视操作会更简单、性能更高,您可以在其中将源和查找 tables 转换为长格式,加入它们,然后再次重新整形.
鉴于此调查信息:
survey_df = structure(list(resp_id = 1:5,
gender = c(1L, 2L, 2L, 1L, 1L),
state = c(1, 2, 3, 1, 4),
education = c(1L, 1L, 1L, 2L, 2L)), class = "data.frame", row.names = c(NA, -5L)) %>%
mutate(across(-resp_id, as.character))
我们可以将查找 table 转换为长格式:
coded_df_long <- coded_df %>%
separate_rows(col_values, sep = ",") %>%
separate(col_values, c("old", "new"), extra = "merge")
然后将调查转向长轴,加入编码,然后再次转向宽轴。
survey_df %>%
pivot_longer(-resp_id) %>%
left_join(coded_df_long, by = c("name" = "col", "value" = "old")) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = new)
结果
# A tibble: 5 x 4
resp_id gender state education
<int> <chr> <chr> <chr>
1 1 Male CA High School
2 2 Female TX High School
3 3 Female AZ High School
4 4 Male CA Bachelor
5 5 Male CO Bachelor
我有一个包含 100 多个变量的调查数据集,几乎所有变量都有 1-10 个代码值。每列的代码值在另一个 df 中提供。
示例数据:
survey_df = structure(list(resp_id = 1:5, gender = c("1", "2", "2", "1",
"1"), state = c("1", "2", "3", "1", "4"), education = c("1",
"1", "1", "2", "2")), class = "data.frame", row.names = c(NA,
-5L))
coded_df = structure(list(col = c("state", "gender", "education"), col_values = c("1-CA,2-TX,3-AZ,4-CO",
"1-Male,2-Female", "1-High School,2-Bachelor")), class = "data.frame", row.names = c(NA,
-3L))
由于调查列发生了变化 time/product 我想避免任何硬编码重新编码,因此有一个函数可以输入列名和 return 来自 coded_df.
get_named_vec <- function(x) {
tmp_chr <- coded_df %>%
filter(col == x) %>%
mutate(col_values = str_replace_all(col_values, "\n", "")) %>%
separate_rows(col_values, sep = ",") %>%
separate(col_values, into = c("var1", "var2"), sep = "-") %>%
mutate(var1 = as.character(as.numeric(var1)),
var2 = str_trim(var2)) %>%
pull(var2, var1)
return(tmp_chr)
}
然后我使用如下命名向量来更新 survey_df。
survey_df%>%
mutate(gender = recode(gender,!!!get_named_vec("gender"),.default = "NA_character_"))
到目前为止,这项工作是在每列的基础上进行的,这意味着执行了 100 多次!
但是我如何通过 mutate_at 制作这个 运行 以便我在单次执行中有选择地重新编码某些变量。
# This does not work.
to_update_col<-c("state","gender")
survey_df%>%
mutate_at(.vars=all_of(to_update_col),.funs=function(x) recode(x,!!!get_named_vec(x))))
非常感谢任何帮助!
谢谢
维奈
我希望将其转换为数据透视-连接-数据透视操作会更简单、性能更高,您可以在其中将源和查找 tables 转换为长格式,加入它们,然后再次重新整形.
鉴于此调查信息:
survey_df = structure(list(resp_id = 1:5,
gender = c(1L, 2L, 2L, 1L, 1L),
state = c(1, 2, 3, 1, 4),
education = c(1L, 1L, 1L, 2L, 2L)), class = "data.frame", row.names = c(NA, -5L)) %>%
mutate(across(-resp_id, as.character))
我们可以将查找 table 转换为长格式:
coded_df_long <- coded_df %>%
separate_rows(col_values, sep = ",") %>%
separate(col_values, c("old", "new"), extra = "merge")
然后将调查转向长轴,加入编码,然后再次转向宽轴。
survey_df %>%
pivot_longer(-resp_id) %>%
left_join(coded_df_long, by = c("name" = "col", "value" = "old")) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = new)
结果
# A tibble: 5 x 4
resp_id gender state education
<int> <chr> <chr> <chr>
1 1 Male CA High School
2 2 Female TX High School
3 3 Female AZ High School
4 4 Male CA Bachelor
5 5 Male CO Bachelor