使用频率列将宽转换为长
Convert Wide to Long with frequency column
我正在尝试将我的 data.frame 从宽 table 转换为带有频率列的长 table。
data("UCBAdmissions")
ucb_admit <- as.data.frame(UCBAdmissions)
ucb_admit
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
...
我想收集这些数据(tidyr 包,类似于从重塑中熔化)但使用 Freq 指定该行应重复多少次。
我的目标数据因此看起来像:
Admit Gender Dept
1 Admitted Male A
2 Admitted Male A
3 Admitted Male A
4 Admitted Male A
5 Admitted Male A
6 Admitted Male A
...
4523 Rejected Female F
4524 Rejected Female F
4525 Rejected Female F
4526 Rejected Female F
我想使用 tidyr::gather() 来执行此操作,但是我的结果不正确,因为我不确定 if/how 是否包含 Freq 列?
谢谢
这看起来不像 gather
的工作,因为数据是聚合的,而不是广泛的。您可以通过为每一行重复行索引 Freq
次来使用索引 "disaggregate" 数据。以下是使用基数 R 和 dplyr
.
的方法
library(dplyr)
# Base R
ucb_admit_disagg = ucb_admit[rep(1:nrow(ucb_admit), ucb_admit$Freq),
-grep("Freq", names(ucb_admit))]
# dplyr
ucb_admit_disagg = ucb_admit %>%
slice(rep(1:n(), Freq)) %>%
select(-Freq)
这是数据框的一部分。我在输出中添加了省略号以标记行序列中的中断。
ucb_admit_disagg[c(1:6, 510:514, 4523:4526), ]
Admit Gender Dept
1 Admitted Male A
1.1 Admitted Male A
1.2 Admitted Male A
1.3 Admitted Male A
1.4 Admitted Male A
1.5 Admitted Male A
...
1.509 Admitted Male A
1.510 Admitted Male A
1.511 Admitted Male A
2 Rejected Male A
2.1 Rejected Male A
...
24.313 Rejected Female F
24.314 Rejected Female F
24.315 Rejected Female F
24.316 Rejected Female F
这是使用 dplyr
、tidyr
和 purrr
的解决方案。
library(dplyr)
library(tidyr)
library(purrr)
ucb_admit2 <- ucb_admit %>%
mutate(Freq = map2(1, Freq, `:`)) %>%
unnest() %>%
select(-Freq)
或者使用这种类似的方法,它只需要来自 dplyr
和 tidyr
的函数。
ucb_admit2 <- ucb_admit %>%
rowwise() %>%
mutate(Freq = list(seq(1, Freq))) %>%
ungroup() %>%
unnest() %>%
select(-Freq)
两者采用相同的策略:创建一个列表列,然后unnest
它。
我们也可以考虑使用tidyr
中的separate_row
函数来完成这个任务。
ucb_admit2 <- ucb_admit %>%
rowwise() %>%
mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
ungroup() %>%
separate_rows(Freq) %>%
select(-Freq)
基准测试
我对比了eipi10提出的两种方法和我提出的三种方法,使用如下microbenchmarking
。结果显示基本 R 方法是最快的,其次是 dplyr
重复和切片方法。所以,我认为除非有其他考虑,比如代码可读性,否则这道题没必要用tidyr
或purrr
。
library(microbenchmark)
library(microbenchmark)
microbenchmark(m1 = (ucb_admit[rep(1:nrow(ucb_admit),
ucb_admit$Freq),
-grep("Freq", names(ucb_admit))]),
m2 = (ucb_admit %>%
slice(rep(1:n(), Freq)) %>%
select(-Freq)),
m3 = (ucb_admit %>%
mutate(Freq = map2(1, Freq, `:`)) %>%
unnest() %>%
select(-Freq)),
m4 = (ucb_admit %>%
rowwise() %>%
mutate(Freq = list(seq(1, Freq))) %>%
ungroup() %>%
unnest() %>%
select(-Freq)),
m5 = (ucb_admit %>%
rowwise() %>%
mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
ungroup() %>%
separate_rows(Freq) %>%
select(-Freq)))
Unit: milliseconds
expr min lq mean median uq max neval
m1 3.455026 3.585888 4.295322 3.845367 4.147506 8.60228 100
m2 6.888881 7.541269 8.849527 8.031040 9.428189 15.53991 100
m3 23.252458 24.959122 29.706875 27.414396 32.506805 61.00691 100
m4 20.033499 21.914645 25.888155 23.611688 27.310155 101.15088 100
m5 28.972557 31.127297 35.976468 32.652422 37.669135 64.43884 100
我正在尝试将我的 data.frame 从宽 table 转换为带有频率列的长 table。
data("UCBAdmissions")
ucb_admit <- as.data.frame(UCBAdmissions)
ucb_admit
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
...
我想收集这些数据(tidyr 包,类似于从重塑中熔化)但使用 Freq 指定该行应重复多少次。
我的目标数据因此看起来像:
Admit Gender Dept
1 Admitted Male A
2 Admitted Male A
3 Admitted Male A
4 Admitted Male A
5 Admitted Male A
6 Admitted Male A
...
4523 Rejected Female F
4524 Rejected Female F
4525 Rejected Female F
4526 Rejected Female F
我想使用 tidyr::gather() 来执行此操作,但是我的结果不正确,因为我不确定 if/how 是否包含 Freq 列?
谢谢
这看起来不像 gather
的工作,因为数据是聚合的,而不是广泛的。您可以通过为每一行重复行索引 Freq
次来使用索引 "disaggregate" 数据。以下是使用基数 R 和 dplyr
.
library(dplyr)
# Base R
ucb_admit_disagg = ucb_admit[rep(1:nrow(ucb_admit), ucb_admit$Freq),
-grep("Freq", names(ucb_admit))]
# dplyr
ucb_admit_disagg = ucb_admit %>%
slice(rep(1:n(), Freq)) %>%
select(-Freq)
这是数据框的一部分。我在输出中添加了省略号以标记行序列中的中断。
ucb_admit_disagg[c(1:6, 510:514, 4523:4526), ]
Admit Gender Dept 1 Admitted Male A 1.1 Admitted Male A 1.2 Admitted Male A 1.3 Admitted Male A 1.4 Admitted Male A 1.5 Admitted Male A ... 1.509 Admitted Male A 1.510 Admitted Male A 1.511 Admitted Male A 2 Rejected Male A 2.1 Rejected Male A ... 24.313 Rejected Female F 24.314 Rejected Female F 24.315 Rejected Female F 24.316 Rejected Female F
这是使用 dplyr
、tidyr
和 purrr
的解决方案。
library(dplyr)
library(tidyr)
library(purrr)
ucb_admit2 <- ucb_admit %>%
mutate(Freq = map2(1, Freq, `:`)) %>%
unnest() %>%
select(-Freq)
或者使用这种类似的方法,它只需要来自 dplyr
和 tidyr
的函数。
ucb_admit2 <- ucb_admit %>%
rowwise() %>%
mutate(Freq = list(seq(1, Freq))) %>%
ungroup() %>%
unnest() %>%
select(-Freq)
两者采用相同的策略:创建一个列表列,然后unnest
它。
我们也可以考虑使用tidyr
中的separate_row
函数来完成这个任务。
ucb_admit2 <- ucb_admit %>%
rowwise() %>%
mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
ungroup() %>%
separate_rows(Freq) %>%
select(-Freq)
基准测试
我对比了eipi10提出的两种方法和我提出的三种方法,使用如下microbenchmarking
。结果显示基本 R 方法是最快的,其次是 dplyr
重复和切片方法。所以,我认为除非有其他考虑,比如代码可读性,否则这道题没必要用tidyr
或purrr
。
library(microbenchmark)
library(microbenchmark)
microbenchmark(m1 = (ucb_admit[rep(1:nrow(ucb_admit),
ucb_admit$Freq),
-grep("Freq", names(ucb_admit))]),
m2 = (ucb_admit %>%
slice(rep(1:n(), Freq)) %>%
select(-Freq)),
m3 = (ucb_admit %>%
mutate(Freq = map2(1, Freq, `:`)) %>%
unnest() %>%
select(-Freq)),
m4 = (ucb_admit %>%
rowwise() %>%
mutate(Freq = list(seq(1, Freq))) %>%
ungroup() %>%
unnest() %>%
select(-Freq)),
m5 = (ucb_admit %>%
rowwise() %>%
mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
ungroup() %>%
separate_rows(Freq) %>%
select(-Freq)))
Unit: milliseconds
expr min lq mean median uq max neval
m1 3.455026 3.585888 4.295322 3.845367 4.147506 8.60228 100
m2 6.888881 7.541269 8.849527 8.031040 9.428189 15.53991 100
m3 23.252458 24.959122 29.706875 27.414396 32.506805 61.00691 100
m4 20.033499 21.914645 25.888155 23.611688 27.310155 101.15088 100
m5 28.972557 31.127297 35.976468 32.652422 37.669135 64.43884 100