数据清理 - 转换为 tidyverse
data cleaning - conversion to tidyverse
我很好奇以下代码是否可以转换为 tidyverse 代码。我试过 dplyr::mutate 但没能正常工作。
df$Gender[df$Gender == "M"] <- "Man"
df$Gender[df$Gender == "Male"] <- "Man"
df$Gender[df$Gender == "F"] <- "Woman"
df$Gender[df$Gender == "Female"] <- "Woman"
df$Gender[df$Gender == "M & F"] <- "Man and Woman"
df$Gender[df$Gender == "Male & Female"] <- "Man and Woman"
这是一种方法,dplyr::case_when()
:
df$Gender <- dplyr::case_when(
df$Gender %in% c("M", "Male") ~ "Man",
df$Gender %in% c("F", "Female") ~ "Woman",
df$Gender %in% c("M & F", "Male & Female") ~ "Man and Woman",
TRUE ~ NA_character_)
或者,如果您想使用典型的 dplyr::
/magrittr::
pipe-chain 方法:
df <- df %>% mutate(Gender = case_when(
Gender %in% c("M", "Male") ~ "Man",
Gender %in% c("F", "Female") ~ "Woman",
Gender %in% c("M & F", "Male & Female") ~ "Man and Woman",
TRUE ~ NA_character_))
最后,提示:当需要对大量唯一值进行分组时,使用 case_when()
(或嵌套 ifelse()
或子集赋值等)会变得非常乏味.避免大部分痛苦的一种方法是使用命名向量将每个值替换为 dictionary-style "lookup table"(非正式术语——请参阅 wiki on "associative array" 了解一些背景知识)。根据我的经验,这通常感觉最干净:
# the unique values
gender_values <- c("M","Man","Male","F","Woman","Female","MF","male-female")
# associate unique values with our new labels: "m", "f", and "b"
gender_lkup <- setNames(c("m","m","m","f","f","f","b","b"), gender_values)
# suppose this is a column of a df
raw_column <- sample(gender_values, 10, replace=TRUE)
# create a clean one with `gender_lkup`
clean_column <- gender_lkup[raw_column]
# inspect the two vectors side-by-side
data.frame(original=raw_column, cleaned=clean_column)
我很好奇以下代码是否可以转换为 tidyverse 代码。我试过 dplyr::mutate 但没能正常工作。
df$Gender[df$Gender == "M"] <- "Man"
df$Gender[df$Gender == "Male"] <- "Man"
df$Gender[df$Gender == "F"] <- "Woman"
df$Gender[df$Gender == "Female"] <- "Woman"
df$Gender[df$Gender == "M & F"] <- "Man and Woman"
df$Gender[df$Gender == "Male & Female"] <- "Man and Woman"
这是一种方法,dplyr::case_when()
:
df$Gender <- dplyr::case_when(
df$Gender %in% c("M", "Male") ~ "Man",
df$Gender %in% c("F", "Female") ~ "Woman",
df$Gender %in% c("M & F", "Male & Female") ~ "Man and Woman",
TRUE ~ NA_character_)
或者,如果您想使用典型的 dplyr::
/magrittr::
pipe-chain 方法:
df <- df %>% mutate(Gender = case_when(
Gender %in% c("M", "Male") ~ "Man",
Gender %in% c("F", "Female") ~ "Woman",
Gender %in% c("M & F", "Male & Female") ~ "Man and Woman",
TRUE ~ NA_character_))
最后,提示:当需要对大量唯一值进行分组时,使用 case_when()
(或嵌套 ifelse()
或子集赋值等)会变得非常乏味.避免大部分痛苦的一种方法是使用命名向量将每个值替换为 dictionary-style "lookup table"(非正式术语——请参阅 wiki on "associative array" 了解一些背景知识)。根据我的经验,这通常感觉最干净:
# the unique values
gender_values <- c("M","Man","Male","F","Woman","Female","MF","male-female")
# associate unique values with our new labels: "m", "f", and "b"
gender_lkup <- setNames(c("m","m","m","f","f","f","b","b"), gender_values)
# suppose this is a column of a df
raw_column <- sample(gender_values, 10, replace=TRUE)
# create a clean one with `gender_lkup`
clean_column <- gender_lkup[raw_column]
# inspect the two vectors side-by-side
data.frame(original=raw_column, cleaned=clean_column)