从字符串创建 presence/absence 个变量用于长数据
Create presence/absence variables from character string for long data
假设我有一个这样的数据框:
dat<- data.frame(ID= rep(c("A","B","C","D"),4),
test= rep(c("pre","post"),8),
item= c(rep("item1",8),rep("item2",8))
answer= c("undergraduateeducation_graduateorprofessionalschool_employment",
"graduateorprofessionalschool",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"volunteeractivityoroutreach",
"undergraduateeducation_employment_volunteeractivityoroutreach",
"employment",
"volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment",
"graduateorprofessionalschool",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"volunteeractivityoroutreach",
"undergraduateeducation_employment_volunteeractivityoroutreach",
"employment",
"volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach"))
answer
列表示“select 所有适用”答案类型 - 其中下划线分隔 selected 答案选项。对于每个 ID
、test
和 item
,我想将这个单个变量更改为多个 presence/absence 变量,指示字符串中是否存在该答案组件。 1 表示受访者答案中存在答案选项,0 表示该组件不存在。 res
中的变量undergraduate
、graduate
、employment
和volunteer
分别对应answer
中的以下字符串:undergraduateeducation
, graduateorprofessionalschool
,employment
,volunteeractivityoroutreach
。删除了空格。
结果数据框如下所示:
res<- data.frame(ID= rep(c("A","B","C","D"),4),
test= rep(c("pre","post"),8),
item= c(rep("item1",8),rep("item2",8)),
undergraduate= c(1,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1),
graduate= c(1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1),
employment=c(1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1),
volunteer=c(0,0,1,1,1,0,1,1,0,0,1,1,1,0,1,1))
在基础 R 中你可以这样做:
new_cols <- c('undergraduate', 'graduate', 'employment', 'volunteer')
cbind(dat[1:3],
as.data.frame(do.call(rbind, lapply(strsplit(dat$answer, "_"),
function(x) {
z <- sapply(new_cols, function(y) as.numeric(grepl(paste0("\b", y), x)))
if(is.vector(z)) z else colSums(z)
}))))
#> ID test item undergraduate graduate employment volunteer
#> 1 A pre item1 1 1 1 0
#> 2 B post item1 0 1 0 0
#> 3 C pre item1 1 1 1 1
#> 4 D post item1 0 0 0 1
#> 5 A pre item1 1 0 1 1
#> 6 B post item1 0 0 1 0
#> 7 C pre item1 0 0 0 1
#> 8 D post item1 1 1 1 1
#> 9 A pre item2 1 1 1 0
#> 10 B post item2 0 1 0 0
#> 11 C pre item2 1 1 1 1
#> 12 D post item2 0 0 0 1
#> 13 A pre item2 1 0 1 1
#> 14 B post item2 0 0 1 0
#> 15 C pre item2 0 0 0 1
#> 16 D post item2 1 1 1 1
由 reprex package (v2.0.1)
于 2022-05-05 创建
一种选择是使用 tidyverse
将数据分成 _
上的行,然后仅保留关键字(将用于列名)。然后,我们创建一个值列来记录存在,然后我们可以转向宽格式,并用 0 填充其他值。
library(tidyverse)
result <- dat %>%
mutate(rn = row_number()) %>%
separate_rows(answer, sep = "_") %>%
mutate(answer = str_extract(answer, "undergraduate|graduate|employment|volunteer"),
value = 1) %>%
pivot_wider(names_from = "answer", values_from = "value", values_fill = 0) %>%
select(-rn)
输出
ID test item undergraduate graduate employment volunteer
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A pre item1 1 1 1 0
2 B post item1 0 1 0 0
3 C pre item1 1 1 1 1
4 D post item1 0 0 0 1
5 A pre item1 1 0 1 1
6 B post item1 0 0 1 0
7 C pre item1 0 0 0 1
8 D post item1 1 1 1 1
9 A pre item2 1 1 1 0
10 B post item2 0 1 0 0
11 C pre item2 1 1 1 1
12 D post item2 0 0 0 1
13 A pre item2 1 0 1 1
14 B post item2 0 0 1 0
15 C pre item2 0 0 0 1
16 D post item2 1 1 1 1
测试
identical(result, as_tibble(res))
#[1] TRUE
假设我有一个这样的数据框:
dat<- data.frame(ID= rep(c("A","B","C","D"),4),
test= rep(c("pre","post"),8),
item= c(rep("item1",8),rep("item2",8))
answer= c("undergraduateeducation_graduateorprofessionalschool_employment",
"graduateorprofessionalschool",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"volunteeractivityoroutreach",
"undergraduateeducation_employment_volunteeractivityoroutreach",
"employment",
"volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment",
"graduateorprofessionalschool",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach",
"volunteeractivityoroutreach",
"undergraduateeducation_employment_volunteeractivityoroutreach",
"employment",
"volunteeractivityoroutreach",
"undergraduateeducation_graduateorprofessionalschool_employment_volunteeractivityoroutreach"))
answer
列表示“select 所有适用”答案类型 - 其中下划线分隔 selected 答案选项。对于每个 ID
、test
和 item
,我想将这个单个变量更改为多个 presence/absence 变量,指示字符串中是否存在该答案组件。 1 表示受访者答案中存在答案选项,0 表示该组件不存在。 res
中的变量undergraduate
、graduate
、employment
和volunteer
分别对应answer
中的以下字符串:undergraduateeducation
, graduateorprofessionalschool
,employment
,volunteeractivityoroutreach
。删除了空格。
结果数据框如下所示:
res<- data.frame(ID= rep(c("A","B","C","D"),4),
test= rep(c("pre","post"),8),
item= c(rep("item1",8),rep("item2",8)),
undergraduate= c(1,0,1,0,1,0,0,1,1,0,1,0,1,0,0,1),
graduate= c(1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1),
employment=c(1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,1),
volunteer=c(0,0,1,1,1,0,1,1,0,0,1,1,1,0,1,1))
在基础 R 中你可以这样做:
new_cols <- c('undergraduate', 'graduate', 'employment', 'volunteer')
cbind(dat[1:3],
as.data.frame(do.call(rbind, lapply(strsplit(dat$answer, "_"),
function(x) {
z <- sapply(new_cols, function(y) as.numeric(grepl(paste0("\b", y), x)))
if(is.vector(z)) z else colSums(z)
}))))
#> ID test item undergraduate graduate employment volunteer
#> 1 A pre item1 1 1 1 0
#> 2 B post item1 0 1 0 0
#> 3 C pre item1 1 1 1 1
#> 4 D post item1 0 0 0 1
#> 5 A pre item1 1 0 1 1
#> 6 B post item1 0 0 1 0
#> 7 C pre item1 0 0 0 1
#> 8 D post item1 1 1 1 1
#> 9 A pre item2 1 1 1 0
#> 10 B post item2 0 1 0 0
#> 11 C pre item2 1 1 1 1
#> 12 D post item2 0 0 0 1
#> 13 A pre item2 1 0 1 1
#> 14 B post item2 0 0 1 0
#> 15 C pre item2 0 0 0 1
#> 16 D post item2 1 1 1 1
由 reprex package (v2.0.1)
于 2022-05-05 创建一种选择是使用 tidyverse
将数据分成 _
上的行,然后仅保留关键字(将用于列名)。然后,我们创建一个值列来记录存在,然后我们可以转向宽格式,并用 0 填充其他值。
library(tidyverse)
result <- dat %>%
mutate(rn = row_number()) %>%
separate_rows(answer, sep = "_") %>%
mutate(answer = str_extract(answer, "undergraduate|graduate|employment|volunteer"),
value = 1) %>%
pivot_wider(names_from = "answer", values_from = "value", values_fill = 0) %>%
select(-rn)
输出
ID test item undergraduate graduate employment volunteer
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A pre item1 1 1 1 0
2 B post item1 0 1 0 0
3 C pre item1 1 1 1 1
4 D post item1 0 0 0 1
5 A pre item1 1 0 1 1
6 B post item1 0 0 1 0
7 C pre item1 0 0 0 1
8 D post item1 1 1 1 1
9 A pre item2 1 1 1 0
10 B post item2 0 1 0 0
11 C pre item2 1 1 1 1
12 D post item2 0 0 0 1
13 A pre item2 1 0 1 1
14 B post item2 0 0 1 0
15 C pre item2 0 0 0 1
16 D post item2 1 1 1 1
测试
identical(result, as_tibble(res))
#[1] TRUE