使用 Purrr 和 Dplyr 跨多个数据帧重新编码相似因子水平
Recoding Similar Factor Levels Across Multiple Data Frames Using Purrr and Dplyr
下面是两个简单的数据框。我想重新编码(折叠)Sat1
和 Sat2
列,以便所有满意程度都简单地编码为 Satisfied
,所有不满意程度都编码为 [=15] =].中立将保持中立。因此,这些因素将具有三个级别 - Satisfied, Dissatisfied, and Neutral
。
我通常会通过绑定数据框并使用 lapply
以及 car
包中的重新编码来完成此操作,例如:
DF1[2:3] <- lapply(DF1[2:3], recode, c('"Somewhat Satisfied"= "Satisfied","Satisfied"="Satisfied","Extremely Dissatisfied"="Dissatisfied"........etc, etc
我想使用地图函数来完成此操作,特别是 at_map
(以维护数据框,但我是 purrr
的新手,所以请随时建议其他版本的地图)来自purrr
,以及 dplyr
,tidyr,
stringrand
ggplot2` 所以一切都可以很容易地流水线化。
下面的例子是我想要完成的,但是为了重新编码,我无法让它工作。
http://www.r-bloggers.com/using-purrr-with-dplyr/
我想使用at_map或类似的映射函数,以便我可以保留Sat1
和Sat2
的原始列,因此将添加重新编码的列到数据框并重命名。如果这个步骤也可以包含在一个函数中,那就太好了。
实际上,我会有很多数据框,所以我只想重新编码一次因子水平,然后使用 purrr
中的函数以最少的数量对所有数据框进行更改的代码。
Names<-c("James","Chris","Jessica","Tomoki","Anna","Gerald")
Sat1<-c("Satisfied","Very Satisfied","Dissatisfied","Somewhat Satisfied","Dissatisfied","Neutral")
Sat2<-c("Very Dissatisfied","Somewhat Satisfied","Neutral","Neutral","Satisfied","Satisfied")
Program<-c("A","B","A","C","B","D")
Pets<-c("Snake","Dog","Dog","Dog","Cat","None")
DF1<-data.frame(Names,Sat1,Sat2,Program,Pets)
Names<-c("Tim","John","Amy","Alberto","Desrahi","Francesca")
Sat1<-c("Extremely Satisfied","Satisfied","Satisfed","Somewhat Dissatisfied","Dissatisfied","Satisfied")
Sat2<-c("Dissatisfied","Somewhat Dissatisfied","Neutral","Extremely Dissatisfied","Somewhat Satisfied","Somewhat Dissatisfied")
Program<-c("A","B","A","C","B","D")
DF2<-data.frame(Names,Sat1,Sat2,Program)
我用连接进行了像这样的大型重新编码,在这种情况下,我认为转换为长数据帧会使问题更容易思考。
library(tidyr)
library(dplyr)
mdf <- DF1 %>%
gather(var, value, starts_with("Sat"))
recode_df <- data_frame( value = c("Extremely Satisfied","Satisfied","Somewhat Dissatisfied","Dissatisfied"),
recode = 1:4)
mdf <- left_join(mdf, recode_df)
mdf %>% spread(var, recode)
实现此目的的一种方法是使用 mutate_each
结合 map
函数之一完成工作,以遍历 data.frames 的列表。使用 mutate_each
或 dplyr_0.4.3.9001 中的等效项允许您重命名新列。
在这种情况下,您可以使用字符串操作而不是重新编码。我相信您想从现有的字符串中提取 Satisfied
、Dissatisfied
或 Neutral
。您可以使用 sub
使用正则表达式来实现此目的。例如,
sub(".*(Satisfied|Dissatisfied|Neutral).*$", "\1", DF2$Sat2)
"Dissatisfied" "Dissatisfied" "Neutral" "Dissatisfied" "Satisfied" "Dissatisfied"
Package stringr 有一个很好的提取特定字符串的功能,str_extract
.
library(stringr)
str_extract(DF2$Sat2, "Satisfied|Neutral|Dissatisfied")
"Dissatisfied" "Dissatisfied" "Neutral" "Dissatisfied" "Satisfied" "Dissatisfied"
您可以在 mutate_each
中使用它在多个列上使用这些函数之一。您为 funs
中的函数指定的名称将添加到新列名称中。我用了recode
。对于您的一个数据集:
DF1 %>%
mutate_each( funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ),
starts_with("Sat") )
Names Sat1 Sat2 Program Pets Sat1_recode Sat2_recode
1 James Satisfied Very Dissatisfied A Snake Satisfied Dissatisfied
2 Chris Very Satisfied Somewhat Satisfied B Dog Satisfied Satisfied
3 Jessica Dissatisfied Neutral A Dog Dissatisfied Neutral
4 Tomoki Somewhat Satisfied Neutral C Dog Satisfied Neutral
5 Anna Dissatisfied Satisfied B Cat Dissatisfied Satisfied
6 Gerald Neutral Satisfied D None Neutral Satisfied
要遍历存储在列表中的多个数据集,您可以使用 purrr 中的 map
函数对列表中的每个元素执行一个函数。
list(DF1, DF2) %>%
map(~mutate_each(.x,
funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ),
starts_with("Sat")) )
[[1]]
Names Sat1 Sat2 Program Pets Sat1_recode Sat2_recode
1 James Satisfied Very Dissatisfied A Snake Satisfied Dissatisfied
2 Chris Very Satisfied Somewhat Satisfied B Dog Satisfied Satisfied
...
[[2]]
Names Sat1 Sat2 Program Sat1_recode Sat2_recode
1 Tim Extremely Satisfied Dissatisfied A Satisfied Dissatisfied
2 John Satisfied Somewhat Dissatisfied B Satisfied Dissatisfied
...
改为使用 map_df
会将列表中的所有元素绑定到 data.frame,这可能是您想要的,也可能不是您想要的。使用 .id
参数为每个原始数据集添加一个名称。
list(DF1, DF2) %>%
map_df(~mutate_each(.x,
funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied")),
starts_with("Sat")), .id = "Group")
Group Names Sat1 Sat2 Program Pets Sat1_recode
1 1 James Satisfied Very Dissatisfied A Snake Satisfied
2 1 Chris Very Satisfied Somewhat Satisfied B Dog Satisfied
3 1 Jessica Dissatisfied Neutral A Dog Dissatisfied
4 1 Tomoki Somewhat Satisfied Neutral C Dog Satisfied
5 1 Anna Dissatisfied Satisfied B Cat Dissatisfied
6 1 Gerald Neutral Satisfied D None Neutral
7 2 Tim Extremely Satisfied Dissatisfied A <NA> Satisfied
8 2 John Satisfied Somewhat Dissatisfied B <NA> Satisfied
...
下面是两个简单的数据框。我想重新编码(折叠)Sat1
和 Sat2
列,以便所有满意程度都简单地编码为 Satisfied
,所有不满意程度都编码为 [=15] =].中立将保持中立。因此,这些因素将具有三个级别 - Satisfied, Dissatisfied, and Neutral
。
我通常会通过绑定数据框并使用 lapply
以及 car
包中的重新编码来完成此操作,例如:
DF1[2:3] <- lapply(DF1[2:3], recode, c('"Somewhat Satisfied"= "Satisfied","Satisfied"="Satisfied","Extremely Dissatisfied"="Dissatisfied"........etc, etc
我想使用地图函数来完成此操作,特别是 at_map
(以维护数据框,但我是 purrr
的新手,所以请随时建议其他版本的地图)来自purrr
,以及 dplyr
,tidyr,
stringrand
ggplot2` 所以一切都可以很容易地流水线化。
下面的例子是我想要完成的,但是为了重新编码,我无法让它工作。
http://www.r-bloggers.com/using-purrr-with-dplyr/
我想使用at_map或类似的映射函数,以便我可以保留Sat1
和Sat2
的原始列,因此将添加重新编码的列到数据框并重命名。如果这个步骤也可以包含在一个函数中,那就太好了。
实际上,我会有很多数据框,所以我只想重新编码一次因子水平,然后使用 purrr
中的函数以最少的数量对所有数据框进行更改的代码。
Names<-c("James","Chris","Jessica","Tomoki","Anna","Gerald")
Sat1<-c("Satisfied","Very Satisfied","Dissatisfied","Somewhat Satisfied","Dissatisfied","Neutral")
Sat2<-c("Very Dissatisfied","Somewhat Satisfied","Neutral","Neutral","Satisfied","Satisfied")
Program<-c("A","B","A","C","B","D")
Pets<-c("Snake","Dog","Dog","Dog","Cat","None")
DF1<-data.frame(Names,Sat1,Sat2,Program,Pets)
Names<-c("Tim","John","Amy","Alberto","Desrahi","Francesca")
Sat1<-c("Extremely Satisfied","Satisfied","Satisfed","Somewhat Dissatisfied","Dissatisfied","Satisfied")
Sat2<-c("Dissatisfied","Somewhat Dissatisfied","Neutral","Extremely Dissatisfied","Somewhat Satisfied","Somewhat Dissatisfied")
Program<-c("A","B","A","C","B","D")
DF2<-data.frame(Names,Sat1,Sat2,Program)
我用连接进行了像这样的大型重新编码,在这种情况下,我认为转换为长数据帧会使问题更容易思考。
library(tidyr)
library(dplyr)
mdf <- DF1 %>%
gather(var, value, starts_with("Sat"))
recode_df <- data_frame( value = c("Extremely Satisfied","Satisfied","Somewhat Dissatisfied","Dissatisfied"),
recode = 1:4)
mdf <- left_join(mdf, recode_df)
mdf %>% spread(var, recode)
实现此目的的一种方法是使用 mutate_each
结合 map
函数之一完成工作,以遍历 data.frames 的列表。使用 mutate_each
或 dplyr_0.4.3.9001 中的等效项允许您重命名新列。
在这种情况下,您可以使用字符串操作而不是重新编码。我相信您想从现有的字符串中提取 Satisfied
、Dissatisfied
或 Neutral
。您可以使用 sub
使用正则表达式来实现此目的。例如,
sub(".*(Satisfied|Dissatisfied|Neutral).*$", "\1", DF2$Sat2)
"Dissatisfied" "Dissatisfied" "Neutral" "Dissatisfied" "Satisfied" "Dissatisfied"
Package stringr 有一个很好的提取特定字符串的功能,str_extract
.
library(stringr)
str_extract(DF2$Sat2, "Satisfied|Neutral|Dissatisfied")
"Dissatisfied" "Dissatisfied" "Neutral" "Dissatisfied" "Satisfied" "Dissatisfied"
您可以在 mutate_each
中使用它在多个列上使用这些函数之一。您为 funs
中的函数指定的名称将添加到新列名称中。我用了recode
。对于您的一个数据集:
DF1 %>%
mutate_each( funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ),
starts_with("Sat") )
Names Sat1 Sat2 Program Pets Sat1_recode Sat2_recode
1 James Satisfied Very Dissatisfied A Snake Satisfied Dissatisfied
2 Chris Very Satisfied Somewhat Satisfied B Dog Satisfied Satisfied
3 Jessica Dissatisfied Neutral A Dog Dissatisfied Neutral
4 Tomoki Somewhat Satisfied Neutral C Dog Satisfied Neutral
5 Anna Dissatisfied Satisfied B Cat Dissatisfied Satisfied
6 Gerald Neutral Satisfied D None Neutral Satisfied
要遍历存储在列表中的多个数据集,您可以使用 purrr 中的 map
函数对列表中的每个元素执行一个函数。
list(DF1, DF2) %>%
map(~mutate_each(.x,
funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ),
starts_with("Sat")) )
[[1]]
Names Sat1 Sat2 Program Pets Sat1_recode Sat2_recode
1 James Satisfied Very Dissatisfied A Snake Satisfied Dissatisfied
2 Chris Very Satisfied Somewhat Satisfied B Dog Satisfied Satisfied
...
[[2]]
Names Sat1 Sat2 Program Sat1_recode Sat2_recode
1 Tim Extremely Satisfied Dissatisfied A Satisfied Dissatisfied
2 John Satisfied Somewhat Dissatisfied B Satisfied Dissatisfied
...
改为使用 map_df
会将列表中的所有元素绑定到 data.frame,这可能是您想要的,也可能不是您想要的。使用 .id
参数为每个原始数据集添加一个名称。
list(DF1, DF2) %>%
map_df(~mutate_each(.x,
funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied")),
starts_with("Sat")), .id = "Group")
Group Names Sat1 Sat2 Program Pets Sat1_recode
1 1 James Satisfied Very Dissatisfied A Snake Satisfied
2 1 Chris Very Satisfied Somewhat Satisfied B Dog Satisfied
3 1 Jessica Dissatisfied Neutral A Dog Dissatisfied
4 1 Tomoki Somewhat Satisfied Neutral C Dog Satisfied
5 1 Anna Dissatisfied Satisfied B Cat Dissatisfied
6 1 Gerald Neutral Satisfied D None Neutral
7 2 Tim Extremely Satisfied Dissatisfied A <NA> Satisfied
8 2 John Satisfied Somewhat Dissatisfied B <NA> Satisfied
...