通过模式匹配熔化列
Melting columns by pattern matching
我有一个非常广泛的数据框,其中包含标准的人口统计特征(年龄、性别、种族、教育、收入等)。我还有受访者对问题的回答,这些问题可以用四种方式之一(使用 "cb"、"lb"、"lw" 或 "cw")作为序言。
数据框目前采用宽格式,每一行代表单个受访者的答案。我想将其转换为长格式,但我找不到使用 reshape2 库的直接解决方案。
我想将所有人口统计特征保留为它们自己的列,但将问题、答案、置信度和分数列减少到它们自己的融合列中。这是我正在寻找的东西的想法:
string <- "
response_id,age,sex,race_1,race_2,race_3,cb_1,cb_1_conf,cb_1_ans,cb_1_score,lb_1,lb_1_conf,lb_1_ans,lb_1_score
11,25,M,white,NA,NA,Astrophysicist,9,Dog,0,Jackson,8,Jackson,1
22,27,F,NA,black,asian,Monkey,8,Dog,0,Jackson,7,Jackson,1"
x <- read.csv(con <- textConnection(string), header=TRUE)
看起来像这样:
> x
response_id age sex race_1 race_2 race_3 cb_1 cb_1_conf cb_1_ans cb_1_score lb_1 lb_1_conf lb_1_ans lb_1_score
1 11 25 M white <NA> NA Astrophysicist 9 Dog 0 Jackson 8 Jackson 1
2 22 27 F <NA> black NA Monkey 8 Dog 0 Jackson 7 Jackson 1
希望将其转换成这种形式:
string_2 <- "
response_id,age,sex,race,question,response,confidence,correct_answer,score
11,25,M,white,cb_1,Astrophysicist,9,Dog,0
11,25,M,white,lb_1,Jackson,8,Jackson,1
22,27,F,black/asian,cb_1,Monkey,8,Dog,0
22,27,F,black/asian,lb_1,Jackson,8,Jackson,1
"
x_2 <- read.csv(con <- textConnection(string_2), header=TRUE)
response_id age sex race question response confidence correct_answer score
1 11 25 M white cb_1 Astrophysicist 9 Dog 0
2 11 25 M white lb_1 Jackson 8 Jackson 1
3 22 27 F black/asian cb_1 Monkey 8 Dog 0
4 22 27 F black/asian lb_1 Jackson 8 Jackson 1
我尝试对 df 进行子集化以仅包含以 cb、lb、cw 或 lw 为前缀的列,然后是:
melt(subset, id=c("ResponseID"),
+ measure.vars=grep("^(CB|LB|LW|CW)", colnames(subset)))
But this doesn't allow me to flexibly melt the _conf columns the _ans columns and the _score columns.
I had to modify Maurits' answer a bit to work better for my case. Here is my solution:
df_test <- df_ans %>%
unite(race, contains("race"), sep = "/") %>% # combine race_1,2,3
mutate(race = str_replace_all(race, "(/NA|NA/)", "")) %>% # replace NA from race
select_all( ~ gsub("(^[A-Z][A-Z]_\d+$)", "\1_response", .)) %>% # add "_response" to Q
gather(key, val, -(1:24)) %>% # wide to long
separate(key, c("q1", "q2", "item")) %>% # split into Q + item
unite(question, q1, q2, sep = "_") %>% # [continued]
mutate(item = gsub("_", "", item)) %>% # [continued]
spread(item, val) %>% # long to wide
rename(answer = ans, confidence = con) # rename columns
这是一个tidyverse
解决方案:
x %>%
unite(race, contains("race"), sep = "/") %>% # combine race_1,2,3
mutate(race = str_replace_all(race, "(/NA|NA/)", "")) %>% # replace NA from race
select_all( ~ gsub("^(\w+_\d)$", "\1_response", .)) %>% # add "_response" to Q
gather(key, val, -(1:4)) %>% # wide to long
separate(key, c("q1", "q2", "item")) %>% # split into Q + item
unite(question, q1, q2, sep = "_") %>% # [continued]
mutate(item = gsub("_", "", item)) %>% # [continued]
spread(item, val) %>% # long to wide
rename(answer = ans, confidence = conf) # rename columns
# response_id age sex race question answer confidence response
#1 11 25 M white cb_1 Dog 9 Astrophysicist
#2 11 25 M white lb_1 Jackson 8 Jackson
#3 22 27 F black/asian cb_1 Dog 8 Monkey
#4 22 27 F black/asian lb_1 Jackson 7 Jackson
# score
#1 0
#2 1
#3 0
#4 1
解释:
- 根据
race_1,
race_2,
race_3, whilst removing
NA`s 中的条目创建 unite
d race
.
- 剩下的就是
gather
ing、spread
ing 和 separate
ing 条目以分离出 question
、answer
、confidence
和 response
.
- 我在这里假设所有问题的形式都是
\w+_\d
(例如cb_1
、lb_1
);必要时进行调整。
我有一个非常广泛的数据框,其中包含标准的人口统计特征(年龄、性别、种族、教育、收入等)。我还有受访者对问题的回答,这些问题可以用四种方式之一(使用 "cb"、"lb"、"lw" 或 "cw")作为序言。
数据框目前采用宽格式,每一行代表单个受访者的答案。我想将其转换为长格式,但我找不到使用 reshape2 库的直接解决方案。
我想将所有人口统计特征保留为它们自己的列,但将问题、答案、置信度和分数列减少到它们自己的融合列中。这是我正在寻找的东西的想法:
string <- "
response_id,age,sex,race_1,race_2,race_3,cb_1,cb_1_conf,cb_1_ans,cb_1_score,lb_1,lb_1_conf,lb_1_ans,lb_1_score
11,25,M,white,NA,NA,Astrophysicist,9,Dog,0,Jackson,8,Jackson,1
22,27,F,NA,black,asian,Monkey,8,Dog,0,Jackson,7,Jackson,1"
x <- read.csv(con <- textConnection(string), header=TRUE)
看起来像这样:
> x
response_id age sex race_1 race_2 race_3 cb_1 cb_1_conf cb_1_ans cb_1_score lb_1 lb_1_conf lb_1_ans lb_1_score
1 11 25 M white <NA> NA Astrophysicist 9 Dog 0 Jackson 8 Jackson 1
2 22 27 F <NA> black NA Monkey 8 Dog 0 Jackson 7 Jackson 1
希望将其转换成这种形式:
string_2 <- "
response_id,age,sex,race,question,response,confidence,correct_answer,score
11,25,M,white,cb_1,Astrophysicist,9,Dog,0
11,25,M,white,lb_1,Jackson,8,Jackson,1
22,27,F,black/asian,cb_1,Monkey,8,Dog,0
22,27,F,black/asian,lb_1,Jackson,8,Jackson,1
"
x_2 <- read.csv(con <- textConnection(string_2), header=TRUE)
response_id age sex race question response confidence correct_answer score
1 11 25 M white cb_1 Astrophysicist 9 Dog 0
2 11 25 M white lb_1 Jackson 8 Jackson 1
3 22 27 F black/asian cb_1 Monkey 8 Dog 0
4 22 27 F black/asian lb_1 Jackson 8 Jackson 1
我尝试对 df 进行子集化以仅包含以 cb、lb、cw 或 lw 为前缀的列,然后是:
melt(subset, id=c("ResponseID"),
+ measure.vars=grep("^(CB|LB|LW|CW)", colnames(subset)))
But this doesn't allow me to flexibly melt the _conf columns the _ans columns and the _score columns.
I had to modify Maurits' answer a bit to work better for my case. Here is my solution:
df_test <- df_ans %>%
unite(race, contains("race"), sep = "/") %>% # combine race_1,2,3
mutate(race = str_replace_all(race, "(/NA|NA/)", "")) %>% # replace NA from race
select_all( ~ gsub("(^[A-Z][A-Z]_\d+$)", "\1_response", .)) %>% # add "_response" to Q
gather(key, val, -(1:24)) %>% # wide to long
separate(key, c("q1", "q2", "item")) %>% # split into Q + item
unite(question, q1, q2, sep = "_") %>% # [continued]
mutate(item = gsub("_", "", item)) %>% # [continued]
spread(item, val) %>% # long to wide
rename(answer = ans, confidence = con) # rename columns
这是一个tidyverse
解决方案:
x %>%
unite(race, contains("race"), sep = "/") %>% # combine race_1,2,3
mutate(race = str_replace_all(race, "(/NA|NA/)", "")) %>% # replace NA from race
select_all( ~ gsub("^(\w+_\d)$", "\1_response", .)) %>% # add "_response" to Q
gather(key, val, -(1:4)) %>% # wide to long
separate(key, c("q1", "q2", "item")) %>% # split into Q + item
unite(question, q1, q2, sep = "_") %>% # [continued]
mutate(item = gsub("_", "", item)) %>% # [continued]
spread(item, val) %>% # long to wide
rename(answer = ans, confidence = conf) # rename columns
# response_id age sex race question answer confidence response
#1 11 25 M white cb_1 Dog 9 Astrophysicist
#2 11 25 M white lb_1 Jackson 8 Jackson
#3 22 27 F black/asian cb_1 Dog 8 Monkey
#4 22 27 F black/asian lb_1 Jackson 7 Jackson
# score
#1 0
#2 1
#3 0
#4 1
解释:
- 根据
race_1,
race_2,
race_3, whilst removing
NA`s 中的条目创建unite
drace
. - 剩下的就是
gather
ing、spread
ing 和separate
ing 条目以分离出question
、answer
、confidence
和response
. - 我在这里假设所有问题的形式都是
\w+_\d
(例如cb_1
、lb_1
);必要时进行调整。