通过收集多列来整理数据集?
Tidying dataset by gathering multiple columns?
我想通过这种方式处理数据来整理数据集:
age gender education previous_comp_exp tutorial_time qID.1 time_taken.1 qID.2 time_taken.2
18 Male Undergraduate casual gamer 62.17926 sor9 39.61206 sor8 19.4892
24 Male Undergraduate casual gamer 85.01288 sor9 50.92343 sor8 16.15616
变成这样:
age gender education previous_comp_exp tutorial_time qID time_taken
18 Male Undergraduate casual gamer 62.17926 sor9 39.61206
18 Male Undergraduate casual gamer 62.17926 sor8 19.4892
24 Male Undergraduate casual gamer 85.01288 sor9 50.92343
24 Male Undergraduate casual gamer 85.01288 sor8 16.15616
我已经尝试过 gather()
,但我只能让它与一列一起工作,而且我不断收到此警告:
Warning message:attributes are not identical across measure variables;
they will be dropped
有什么想法吗?
来自 data.table
的 melt
(参见 ?patterns
):
library(data.table)
melt(setDT(df), measure = patterns("^qID", "^time_taken"),
value.name = c("qID", "time_taken"))
结果:
age gender education previous_comp_exp tutorial_time variable qID time_taken
1: 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206
2: 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343
3: 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920
4: 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616
或 tidyr
:
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, qID.1:time_taken.2) %>%
mutate(variable = sub("\.\d$", "", variable)) %>%
group_by(variable) %>%
mutate(ID = row_number()) %>%
spread(variable, value, convert = TRUE) %>%
select(-ID)
结果:
# A tibble: 4 x 7
age gender education previous_comp_exp tutorial_time qID time_taken
<int> <fctr> <fctr> <fctr> <dbl> <chr> <dbl>
1 18 Male Undergraduate casual_gamer 62.17926 sor9 39.61206
2 18 Male Undergraduate casual_gamer 62.17926 sor8 19.48920
3 24 Male Undergraduate casual_gamer 85.01288 sor9 50.92343
4 24 Male Undergraduate casual_gamer 85.01288 sor8 16.15616
注:
对于 tidyr
方法,convert=TRUE
用于将 time_taken
转换回 numeric
,因为它在 gather
ed 时被强制转换为字符qID
列。
数据:
df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L,
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L,
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926,
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"),
time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L,
1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892,
16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp",
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))
在 base R 中,您可以使用强大的 reshape
在单行语句中将数据从宽格式转换为长格式:
reshape(dx,direction="long",
varying=list(grep("qID",colnames(dx)),
grep("time_taken",colnames(dx))),
v.names=c("qID","time_taken"))
age gender education previous_comp_exp tutorial_time time qID time_taken id
1.1 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206 1
2.1 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343 2
1.2 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920 1
2.2 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616 2
我想通过这种方式处理数据来整理数据集:
age gender education previous_comp_exp tutorial_time qID.1 time_taken.1 qID.2 time_taken.2
18 Male Undergraduate casual gamer 62.17926 sor9 39.61206 sor8 19.4892
24 Male Undergraduate casual gamer 85.01288 sor9 50.92343 sor8 16.15616
变成这样:
age gender education previous_comp_exp tutorial_time qID time_taken
18 Male Undergraduate casual gamer 62.17926 sor9 39.61206
18 Male Undergraduate casual gamer 62.17926 sor8 19.4892
24 Male Undergraduate casual gamer 85.01288 sor9 50.92343
24 Male Undergraduate casual gamer 85.01288 sor8 16.15616
我已经尝试过 gather()
,但我只能让它与一列一起工作,而且我不断收到此警告:
Warning message:attributes are not identical across measure variables; they will be dropped
有什么想法吗?
来自 data.table
的 melt
(参见 ?patterns
):
library(data.table)
melt(setDT(df), measure = patterns("^qID", "^time_taken"),
value.name = c("qID", "time_taken"))
结果:
age gender education previous_comp_exp tutorial_time variable qID time_taken
1: 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206
2: 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343
3: 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920
4: 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616
或 tidyr
:
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, qID.1:time_taken.2) %>%
mutate(variable = sub("\.\d$", "", variable)) %>%
group_by(variable) %>%
mutate(ID = row_number()) %>%
spread(variable, value, convert = TRUE) %>%
select(-ID)
结果:
# A tibble: 4 x 7
age gender education previous_comp_exp tutorial_time qID time_taken
<int> <fctr> <fctr> <fctr> <dbl> <chr> <dbl>
1 18 Male Undergraduate casual_gamer 62.17926 sor9 39.61206
2 18 Male Undergraduate casual_gamer 62.17926 sor8 19.48920
3 24 Male Undergraduate casual_gamer 85.01288 sor9 50.92343
4 24 Male Undergraduate casual_gamer 85.01288 sor8 16.15616
注:
对于 tidyr
方法,convert=TRUE
用于将 time_taken
转换回 numeric
,因为它在 gather
ed 时被强制转换为字符qID
列。
数据:
df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L,
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L,
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926,
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"),
time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L,
1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892,
16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp",
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))
在 base R 中,您可以使用强大的 reshape
在单行语句中将数据从宽格式转换为长格式:
reshape(dx,direction="long",
varying=list(grep("qID",colnames(dx)),
grep("time_taken",colnames(dx))),
v.names=c("qID","time_taken"))
age gender education previous_comp_exp tutorial_time time qID time_taken id
1.1 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206 1
2.1 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343 2
1.2 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920 1
2.2 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616 2