R tidyr 将列分布在给定变量的所有类别中
R tidyr spread columns across all categories of a given variable
我正在处理一个看起来像这样的数据集。
#Dataframe
df=data.frame(Type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
Q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
Q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
上下文
数据框由调查问卷的结果组成。
第一列,Type
,指的是回答问卷的员工类型,其中 1 = 'Worker
', 2 = 'Factory Lead
', 3 = ' Administrative Staff
', 4 = 'Middle Management
' & 5 = 'Executive
'
第二和第三列 (Q1
& Q2
) 是问题,评分范围为 1 = 'Strongly Agree
' 到 10 (Strongly Disagree
)。
我正在努力实现的目标
我想根据分数计算每个 Type
的回复总数。
我为分数创建了箱子,它们是 -
1) Low
一致 - 分数从 0 到 4
2) Medium
一致 - 分数为 5 或 6
3) High
一致 - 分数为 7 或 8
4) Very High
一致 - 得分为 9 或 10
所以我想计算每个工人每个分数箱的回复数量。
我的尝试
library(dplyr)
library(tidyr)
result=df %>%
gather(Item,response,-1) %>%
filter(!is.na(response)) %>%
group_by(Type,Item) %>%
filter(!is.na(Type)) %>%
summarise(Low=sum(response %in% c(0,1,2,3,4)),
Medium=sum(response %in% c(5,6)),
High=sum(response %in% c(7,8)),
VHigh=sum(response %in% c(9,10)) %>%
spread(Type,-Item)
我的逻辑是我使用 tidyr
库和第一个 gather
分数来计算总响应。然后展开列,这样我就有了按工人和分数类别分类的小计。
例如,对于第 1 季度,Low-Worker
、Medium-Worker
、High-Worker
、Very High-Worker
、Low-Factory Lead
的总响应列,然后是 Medium-Factory Lead
.... 员工和分数类别的所有组合等等。
很明显我的代码有问题。
期望输出
具有 两行(Q1
& Q2
)和 20 列 的数据框(对于每个员工-分数组合).
如有任何帮助,我们将不胜感激。
像这样?
df%>%
mutate(Type_real=case_when(
Type==1~"Worker",
Type==2~"Factory Lead",
Type==3~"Administrative Staff",
Type==4~"Middle Management",
Type==5~"Executive"),
Score=case_when(
Q1<5~"Low",
Q1>=5 & Q1<=6~"Medium",
Q1>=7 & Q1<=8~"High",
Q1>8~"Very High"))%>%
na.omit()%>%
group_by(Type_real,Score)%>%
summarise(count=n())
# A tibble: 11 x 3
# Groups: Type_real [?]
Type_real Score count
<chr> <chr> <int>
1 Administrative Staff Low 3
2 Administrative Staff Very High 2
3 Executive High 1
4 Factory Lead High 1
5 Factory Lead Low 2
6 Factory Lead Very High 1
7 Middle Management Medium 2
8 Middle Management Very High 1
9 Worker High 1
10 Worker Low 3
11 Worker Medium 1
创建分数数据框
library(tidyr)
library(dplyr)
df <- data_frame(type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
scores <- data_frame(score = 0:10,
scorebin = c(rep("Low", 5),
rep("Medium", 2),
rep("High", 2),
rep("Very High", 2)))
以长格式收集数据。加入分数数据框以添加 scorebin
列。按 item
、type
和 scorebin
分组,统计每组答案的数量。
df2 <- df %>%
gather(item, score, -type) %>%
left_join(scores, by = "score") %>%
group_by(item, type, scorebin) %>%
summarise(n = n()) %>%
unite(employeescore, type, scorebin)
将employeescore
更改为具有有序水平的因子
这样它们就不会按字母顺序显示(高、低、中)
但顺序正确(低、中、高)。
employeescoreorder <- scores %>%
distinct(scorebin) %>%
merge(distinct(df, type)) %>%
unite(employeescore, type, scorebin)
df2$employeescore <- factor(df2$employeescore,
levels = employeescoreorder$employeescore)
以宽格式展开数据框以获得 20 列。
df2 %>%
spread(employeescore, n)
# A tibble: 2 x 20
# Groups: item [2]
item `1_Low` `1_Medium` `1_High` `2_Low` `2_Medium` `2_High` `2_Very High` `4_Low`
* <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 q1 3 1 1 2 NA 1 1 NA
2 q2 1 1 3 2 2 1 NA 2
# ... with 11 more variables: `4_Medium` <int>, `4_High` <int>, `4_Very High` <int>,
# `5_High` <int>, `5_Very High` <int>, `3_Low` <int>, `3_Medium` <int>, `3_High` <int>,
# `3_Very High` <int>, NA_Low <int>, `<NA>` <int>
另一个类似于 Paul Rougieux 的解决方案,但没有连接:
df %>%
mutate(Type = case_when(Type == 1 ~ "Worker",
Type == 2 ~ "Factory Lead",
Type == 3 ~ "Administrative Staff",
Type == 4 ~ "Middle Management",
Type == 5 ~ "Executive")) %>%
mutate_at(c("Q1", "Q2"),
funs(case_when(. %in% 1:4 ~ "Low",
. %in% 5:6 ~ "Medium",
. %in% 7:8 ~ "High",
. %in% 9:10 ~ "Very High"))) %>%
gather(Questions, Score, Q1:Q2) %>%
unite(Type_Score, Type, Score, sep = "_") %>%
count(Questions, Type_Score) %>%
spread(Type_Score, n)
# A tibble: 2 x 21
# Questions `Administrative~ `Administrative~ `Administrative~ `Administrative~ Executive_High Executive_NA `Executive_Very~ `Factory Lead_H~
# <chr> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 Q1 NA 3 NA 2 2 NA NA 1
# 2 Q2 1 3 1 NA NA 1 1 1
# ... with 12 more variables: `Factory Lead_Low` <int>, `Factory Lead_Medium` <int>, `Factory Lead_NA` <int>, `Factory Lead_Very High` <int>,
# `Middle Management_High` <int>, `Middle Management_Low` <int>, `Middle Management_Medium` <int>, `Middle Management_Very High` <int>,
# NA_Low <int>, Worker_High <int>, Worker_Low <int>, Worker_Medium <int>
我正在处理一个看起来像这样的数据集。
#Dataframe
df=data.frame(Type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
Q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
Q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
上下文
数据框由调查问卷的结果组成。
第一列,Type
,指的是回答问卷的员工类型,其中 1 = 'Worker
', 2 = 'Factory Lead
', 3 = ' Administrative Staff
', 4 = 'Middle Management
' & 5 = 'Executive
'
第二和第三列 (Q1
& Q2
) 是问题,评分范围为 1 = 'Strongly Agree
' 到 10 (Strongly Disagree
)。
我正在努力实现的目标
我想根据分数计算每个 Type
的回复总数。
我为分数创建了箱子,它们是 -
1) Low
一致 - 分数从 0 到 4
2) Medium
一致 - 分数为 5 或 6
3) High
一致 - 分数为 7 或 8
4) Very High
一致 - 得分为 9 或 10
所以我想计算每个工人每个分数箱的回复数量。
我的尝试
library(dplyr)
library(tidyr)
result=df %>%
gather(Item,response,-1) %>%
filter(!is.na(response)) %>%
group_by(Type,Item) %>%
filter(!is.na(Type)) %>%
summarise(Low=sum(response %in% c(0,1,2,3,4)),
Medium=sum(response %in% c(5,6)),
High=sum(response %in% c(7,8)),
VHigh=sum(response %in% c(9,10)) %>%
spread(Type,-Item)
我的逻辑是我使用 tidyr
库和第一个 gather
分数来计算总响应。然后展开列,这样我就有了按工人和分数类别分类的小计。
例如,对于第 1 季度,Low-Worker
、Medium-Worker
、High-Worker
、Very High-Worker
、Low-Factory Lead
的总响应列,然后是 Medium-Factory Lead
.... 员工和分数类别的所有组合等等。
很明显我的代码有问题。
期望输出
具有 两行(Q1
& Q2
)和 20 列 的数据框(对于每个员工-分数组合).
如有任何帮助,我们将不胜感激。
像这样?
df%>%
mutate(Type_real=case_when(
Type==1~"Worker",
Type==2~"Factory Lead",
Type==3~"Administrative Staff",
Type==4~"Middle Management",
Type==5~"Executive"),
Score=case_when(
Q1<5~"Low",
Q1>=5 & Q1<=6~"Medium",
Q1>=7 & Q1<=8~"High",
Q1>8~"Very High"))%>%
na.omit()%>%
group_by(Type_real,Score)%>%
summarise(count=n())
# A tibble: 11 x 3
# Groups: Type_real [?]
Type_real Score count
<chr> <chr> <int>
1 Administrative Staff Low 3
2 Administrative Staff Very High 2
3 Executive High 1
4 Factory Lead High 1
5 Factory Lead Low 2
6 Factory Lead Very High 1
7 Middle Management Medium 2
8 Middle Management Very High 1
9 Worker High 1
10 Worker Low 3
11 Worker Medium 1
创建分数数据框
library(tidyr)
library(dplyr)
df <- data_frame(type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
scores <- data_frame(score = 0:10,
scorebin = c(rep("Low", 5),
rep("Medium", 2),
rep("High", 2),
rep("Very High", 2)))
以长格式收集数据。加入分数数据框以添加 scorebin
列。按 item
、type
和 scorebin
分组,统计每组答案的数量。
df2 <- df %>%
gather(item, score, -type) %>%
left_join(scores, by = "score") %>%
group_by(item, type, scorebin) %>%
summarise(n = n()) %>%
unite(employeescore, type, scorebin)
将employeescore
更改为具有有序水平的因子
这样它们就不会按字母顺序显示(高、低、中)
但顺序正确(低、中、高)。
employeescoreorder <- scores %>%
distinct(scorebin) %>%
merge(distinct(df, type)) %>%
unite(employeescore, type, scorebin)
df2$employeescore <- factor(df2$employeescore,
levels = employeescoreorder$employeescore)
以宽格式展开数据框以获得 20 列。
df2 %>%
spread(employeescore, n)
# A tibble: 2 x 20
# Groups: item [2]
item `1_Low` `1_Medium` `1_High` `2_Low` `2_Medium` `2_High` `2_Very High` `4_Low`
* <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 q1 3 1 1 2 NA 1 1 NA
2 q2 1 1 3 2 2 1 NA 2
# ... with 11 more variables: `4_Medium` <int>, `4_High` <int>, `4_Very High` <int>,
# `5_High` <int>, `5_Very High` <int>, `3_Low` <int>, `3_Medium` <int>, `3_High` <int>,
# `3_Very High` <int>, NA_Low <int>, `<NA>` <int>
另一个类似于 Paul Rougieux 的解决方案,但没有连接:
df %>%
mutate(Type = case_when(Type == 1 ~ "Worker",
Type == 2 ~ "Factory Lead",
Type == 3 ~ "Administrative Staff",
Type == 4 ~ "Middle Management",
Type == 5 ~ "Executive")) %>%
mutate_at(c("Q1", "Q2"),
funs(case_when(. %in% 1:4 ~ "Low",
. %in% 5:6 ~ "Medium",
. %in% 7:8 ~ "High",
. %in% 9:10 ~ "Very High"))) %>%
gather(Questions, Score, Q1:Q2) %>%
unite(Type_Score, Type, Score, sep = "_") %>%
count(Questions, Type_Score) %>%
spread(Type_Score, n)
# A tibble: 2 x 21
# Questions `Administrative~ `Administrative~ `Administrative~ `Administrative~ Executive_High Executive_NA `Executive_Very~ `Factory Lead_H~
# <chr> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 Q1 NA 3 NA 2 2 NA NA 1
# 2 Q2 1 3 1 NA NA 1 1 1
# ... with 12 more variables: `Factory Lead_Low` <int>, `Factory Lead_Medium` <int>, `Factory Lead_NA` <int>, `Factory Lead_Very High` <int>,
# `Middle Management_High` <int>, `Middle Management_Low` <int>, `Middle Management_Medium` <int>, `Middle Management_Very High` <int>,
# NA_Low <int>, Worker_High <int>, Worker_Low <int>, Worker_Medium <int>