使用 grepl() 生成虚拟变量
Generating a dummy variable using grepl()
我写了以下内容并且它有效 w/out 错误。
df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE))
df2$qualifications
这是输出,如果提到上述任何一个词,则显示 1,否则显示 0。
[1] 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0
[51] 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1
[101] 0 1 0 0
这是一个数据集,其中包含职位发布以及他们正在搜索的教育资格,我有兴趣为职位描述中提到的每个教育水平分配一个虚拟变量。
具体来说,我正在寻找如下所示的内容,其中
0 是没有提到资格的地方
1 高中
2 学士
3名硕士
4 博士
1] 0 2 4 1 3 1 0 1 0 1 1 1 2 1 0 1
使用 for 循环:
df2 = data.frame(description = sample(educ, 100, TRUE))
df2$qualifications = NA #creating empty column
#placing the possible levels into a vector
educ = c("high school", "Bachelor", "master", "phd")
#for each value in educ, if description has that value assign the new column one of the 4 numbers
for(i in educ){
value = grepl(i, df2$description, ignore.case=TRUE)
df2$qualifications[which(value)] = (1:4)[educ==i]}
由于您已经在创建分类变量,因此我建议使用
使用plyr
的mapvalues
函数:
tibble::tibble(
dummy_data = sample(c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 20, replace = T)
) %>%
mutate(
dummy_variable = plyr::mapvalues(dummy_data, c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 0:4),
dummy_variable = as.integer(dummy_variable)
)
输出:
# A tibble: 20 x 2
dummy_data dummy_variable
<chr> <int>
1 no qual 0
2 phd 4
3 phd 4
4 high school 1
5 no qual 0
6 phd 4
7 no qual 0
8 no qual 0
9 no qual 0
10 no qual 0
11 master 3
12 phd 4
13 high school 1
14 no qual 0
15 Bachelor 2
16 high school 1
17 high school 1
18 phd 4
19 phd 4
20 phd 4
您也可以使用 dplyr
中的 case_when
执行此操作:
library(dplyr)
df %>%
dplyr::mutate(qualifications = case_when(
grepl("high school", description, ignore.case = T) ~ 1,
grepl("Bachelor", description, ignore.case = T) ~ 2,
grepl("master", description, ignore.case = T) ~ 3,
grepl("phd", description, ignore.case = T) ~ 4,
T ~ 0
))
我写了以下内容并且它有效 w/out 错误。
df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE))
df2$qualifications
这是输出,如果提到上述任何一个词,则显示 1,否则显示 0。
[1] 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0
[51] 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1
[101] 0 1 0 0
这是一个数据集,其中包含职位发布以及他们正在搜索的教育资格,我有兴趣为职位描述中提到的每个教育水平分配一个虚拟变量。
具体来说,我正在寻找如下所示的内容,其中 0 是没有提到资格的地方 1 高中 2 学士 3名硕士 4 博士
1] 0 2 4 1 3 1 0 1 0 1 1 1 2 1 0 1
使用 for 循环:
df2 = data.frame(description = sample(educ, 100, TRUE))
df2$qualifications = NA #creating empty column
#placing the possible levels into a vector
educ = c("high school", "Bachelor", "master", "phd")
#for each value in educ, if description has that value assign the new column one of the 4 numbers
for(i in educ){
value = grepl(i, df2$description, ignore.case=TRUE)
df2$qualifications[which(value)] = (1:4)[educ==i]}
由于您已经在创建分类变量,因此我建议使用
使用plyr
的mapvalues
函数:
tibble::tibble(
dummy_data = sample(c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 20, replace = T)
) %>%
mutate(
dummy_variable = plyr::mapvalues(dummy_data, c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 0:4),
dummy_variable = as.integer(dummy_variable)
)
输出:
# A tibble: 20 x 2
dummy_data dummy_variable
<chr> <int>
1 no qual 0
2 phd 4
3 phd 4
4 high school 1
5 no qual 0
6 phd 4
7 no qual 0
8 no qual 0
9 no qual 0
10 no qual 0
11 master 3
12 phd 4
13 high school 1
14 no qual 0
15 Bachelor 2
16 high school 1
17 high school 1
18 phd 4
19 phd 4
20 phd 4
您也可以使用 dplyr
中的 case_when
执行此操作:
library(dplyr)
df %>%
dplyr::mutate(qualifications = case_when(
grepl("high school", description, ignore.case = T) ~ 1,
grepl("Bachelor", description, ignore.case = T) ~ 2,
grepl("master", description, ignore.case = T) ~ 3,
grepl("phd", description, ignore.case = T) ~ 4,
T ~ 0
))