使用 grepl() 生成虚拟变量

Generating a dummy variable using grepl()

我写了以下内容并且它有效 w/out 错误。

df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE))
df2$qualifications

这是输出,如果提到上述任何一个词,则显示 1,否则显示 0。

[1] 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0
 [51] 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1
[101] 0 1 0 0

这是一个数据集,其中包含职位发布以及他们正在搜索的教育资格,我有兴趣为职位描述中提到的每个教育水平分配一个虚拟变量。

具体来说,我正在寻找如下所示的内容,其中 0 是没有提到资格的地方 1 高中 2 学士 3名硕士 4 博士

1] 0 2 4 1 3 1 0 1 0 1 1 1 2 1 0 1 

使用 for 循环:

df2 = data.frame(description = sample(educ, 100, TRUE))
df2$qualifications = NA #creating empty column

#placing the possible levels into a vector
educ = c("high school", "Bachelor", "master", "phd")

#for each value in educ, if description has that value assign the new column one of the 4 numbers
for(i in educ){
  value = grepl(i, df2$description, ignore.case=TRUE)
  df2$qualifications[which(value)] = (1:4)[educ==i]}

由于您已经在创建分类变量,因此我建议使用

使用plyrmapvalues函数:

tibble::tibble(
  dummy_data = sample(c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 20, replace = T)
) %>% 
  mutate(
    dummy_variable = plyr::mapvalues(dummy_data, c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 0:4),
    dummy_variable = as.integer(dummy_variable)
  )

输出:

# A tibble: 20 x 2
   dummy_data  dummy_variable
   <chr>                <int>
 1 no qual                  0
 2 phd                      4
 3 phd                      4
 4 high school              1
 5 no qual                  0
 6 phd                      4
 7 no qual                  0
 8 no qual                  0
 9 no qual                  0
10 no qual                  0
11 master                   3
12 phd                      4
13 high school              1
14 no qual                  0
15 Bachelor                 2
16 high school              1
17 high school              1
18 phd                      4
19 phd                      4
20 phd                      4

您也可以使用 dplyr 中的 case_when 执行此操作:

library(dplyr)

df %>% 
  dplyr::mutate(qualifications = case_when(
    grepl("high school", description, ignore.case = T) ~ 1,
    grepl("Bachelor", description, ignore.case = T) ~ 2,
    grepl("master", description, ignore.case = T) ~ 3,
    grepl("phd", description, ignore.case = T) ~ 4,
    T ~ 0
  ))