如何将分类变量转换为 R 中的多个虚拟变量?

How do convert a categorical variable into multiple dummy variables in R?

这里我有一个列名为 Age = (24 or under, 25 to 34, 35 to 44, 45 to 54, 25 to 34, 24 or under,35 to 44, 25 to 34, 45 到 54)

现在我需要按如下方式转换分类变量 "Age" 的值: 24 或以下等于 1, 25到34等于2, 35到44等于3, 45 到 54 等于 4

有人可以帮我吗?

非常感谢。

您可以使用嵌套的 ifelse 语句:

set.seed(12)
df <- data.frame(Age = c(sample(c("24 or under", "25 to 34", "35 to 44", "45 to 54"), 20, replace = T)))
df$Age_new <- ifelse(df$Age == "24 or under", 1,
                     ifelse(df$Age == "25 to 34", 2,
                            ifelse(df$Age == "35 to 44", 3, 4)))

结果:

df
           Age Age_new
1     25 to 34       2
2     35 to 44       3
3  24 or under       1
4     45 to 54       4
5  24 or under       1
6     35 to 44       3
7     45 to 54       4
8     25 to 34       2
9     45 to 54       4
10    35 to 44       3
11 24 or under       1
12    35 to 44       3
13    25 to 34       2
14 24 or under       1
15    25 to 34       2
16    35 to 44       3
17    25 to 34       2
18    25 to 34       2
19    35 to 44       3
20    25 to 34       2

如果您的年龄列是一个因素,这实际上会在屏幕后面自动发生(每个值都存储为一个整数并具有相应的文本标签)。要显式获取这些整数,您可以使用 as.numeric().

df <- data.frame(Age = c("24 or under", "25 to 34", "35 to 44", "45 to 54"))

df$Age_cat <- as.numeric(df$Age)

如果关卡的顺序与原始顺序不同,您可能 运行 对问题进行排序。在这种情况下,您可以明确设置因子的水平。

正如 pieterbons 所述,您的年龄字段实际上已经是一个因素。如果你 将 Age 转换为数字类型,您将在数字类别中获得数据。

df <- data.frame(Age = c("24 or under", "25 to 34", "35 to 44", "45 to 54"))
df$Age <- as.numeric(df$Age)

您还可以按照您的描述使用年龄字段的虚拟代码创建一个新字段(如果您有一个字符串变量,此选项将特别有用你想转换成一个因子,但它有一个非常不同的顺序),有多种方法可以做到这一点:

# 1) Base R
df$age_new <- as.numeric(df$Age)


# 2) dplyr
library(dplyr)
df <- df %>% 
  mutate(Age = case_when(Age == "24 or under" ~ 1,
                         Age == "25 to 34"    ~ 2,
                         Age == "35 to 44"    ~ 3, 
                         TRUE                 ~ 4))

#> df
#          Age age_new
#1 24 or under       1
#2    25 to 34       2
#3    35 to 44       3
#4    45 to 54       4

如果您想要一个虚拟变量(即 0 或 1),您可以使用 dplyr::if_else 语句为每个类别创建一个新变量:

library(dplyr)

Age = c("24 or under", "25 to 34", "35 to 44", "45 to 54")
data.frame(age = Age) %>%
    mutate("24 or under" = if_else(age == Age[1], 1, 0),
           "25 to 34"    = if_else(age == Age[2], 1, 0),
           "35 to 44"    = if_else(age == Age[3], 1, 0),
           "45 to 54"    = if_else(age == Age[4], 1, 0))

如果您想要数值,请将您的变量编码为 factor,按照您想要的顺序设置级别,然后使用 as.numeric:

Age = factor(c("24 or under", "25 to 34", "35 to 44", "45 to 54"),
         levels = c(c("24 or under", "25 to 34", "35 to 44", "45 to 54")))

as.numeric(Age)