根据带数字的模式自动排序宽 data.table: 特定 order/setcolorder 中的 dcast 列
Automatically order a wide data.table: dcast columns in a specific order/setcolorder based on a pattern with numbers
我有一个这样的DT:
id = c(rep(1,10), rep(2, 5), rep(3,12))
th = c(rep(c(0,1),5), c(0, 1, 0, 1, 0), rep(c(1,0,1),4 ))
drugs = c(rep(c("A","B","C","D","E"),2), c("A", "B", "B", "B", "A"), rep(c("C","D","c"),4 ))
DT = data.table(id, th, drugs)
DT
id th drugs seq
1: 1 0 A 1
2: 1 1 B 2
3: 1 0 C 3
4: 1 1 D 4
5: 1 0 E 5
6: 1 1 A 6
7: 1 0 B 7
8: 1 1 C 8
9: 1 0 D 9
10: 1 1 E 10
11: 2 0 A 1
12: 2 1 B 2
13: 2 0 B 3
14: 2 1 B 4
15: 2 0 A 5
16: 3 1 C 1
17: 3 0 D 2
18: 3 1 c 3
19: 3 1 C 4
20: 3 0 D 5
21: 3 1 c 6
22: 3 1 C 7
23: 3 0 D 8
24: 3 1 c 9
25: 3 1 C 10
26: 3 0 D 11
27: 3 1 c 12
我在按 id 制作计数器(“seq”)后将所有药物合二为一 observation/id
DT_wide = DT[, seq := seq(.N), by = .(id)][, dcast.data.table(.SD, id ~ paste0("rx", seq), value.var = c("th", "drugs"))]
获得:
DT_wide
id th_rx1 th_rx10 th_rx11 th_rx12 th_rx2 th_rx3 th_rx4 th_rx5 th_rx6 th_rx7 th_rx8 th_rx9 drugs_rx1 drugs_rx10 drugs_rx11 drugs_rx12 drugs_rx2 drugs_rx3 drugs_rx4 drugs_rx5 drugs_rx6 drugs_rx7 drugs_rx8 drugs_rx9
1: 1 0 1 NA NA 1 0 1 0 1 0 1 0 A E <NA> <NA> B C D E A B C D
2: 2 0 NA NA NA 1 0 1 0 NA NA NA NA A <NA> <NA> <NA> B B B A <NA> <NA> <NA> <NA>
3: 3 1 1 0 1 0 1 1 0 1 1 0 1 C C D c D c C D c C D c
所需的输出 是 DT_wide 列以这种方式排序:
"id", "th_rx1","drugs_rx1", "th_rx2", "drugs_rx2",...,"th_rx12", "drugs_rx12"
是否有更好的方法来执行 dcast,或者它需要一个 post dcast setcolorder()
和特定的正则表达式?
我尝试了 setcolorder 但没有结果,因为我发现了 _rx1 和 _rx10 问题:
setcolorder(DT_wide, c("id", grep("_rx1", colnames(DT_wide), value = TRUE)))
提前感谢您的帮助! :D
尝试以下解决方法
cols <- c("th", "drugs")
# a shorter way of achieving your dcast
# numbering comes from rowid()
DT.wide <- dcast(DT, id ~ paste0("rx", rowid(id)), value.var = cols)
# new order of colnames
new_colorder <- CJ(unique(rowid(DT$id)), cols, sorted = FALSE)[, paste(cols, V1, sep = "_rx")]
# reorder the relevant columns
setcolorder(DT.wide, c(setdiff(names(DT.wide), new_colorder), new_colorder))
# id th_rx1 drugs_rx1 th_rx2 drugs_rx2 th_rx3 drugs_rx3 th_rx4 drugs_rx4 th_rx5 drugs_rx5 th_rx6
# 1: 1 0 A 1 B 0 C 1 D 0 E 1
# 2: 2 0 A 1 B 0 B 1 B 0 A NA
# 3: 3 1 C 0 D 1 c 1 C 0 D 1
# drugs_rx6 th_rx7 drugs_rx7 th_rx8 drugs_rx8 th_rx9 drugs_rx9 th_rx10 drugs_rx10 th_rx11 drugs_rx11
# 1: A 0 B 1 C 0 D 1 E NA <NA>
# 2: <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA <NA>
# 3: c 1 C 0 D 1 c 1 C 0 D
# th_rx12 drugs_rx12
# 1: NA <NA>
# 2: NA <NA>
# 3: 1 c
与几乎相同的方法,但在细节上有所不同,例如dcast()
中sprintf()
和rowid(id)
的使用:
library(data.table)
library(magrittr)
DTw <- dcast(DT, id ~ sprintf("rx%02i", rowid(id)), value.var = c("th", "drugs"))
newcols <- DT[, CJ(max(rowid(id)) %>% seq() %>% sprintf("rx%02i", .),
setdiff(names(.SD), "id"))][
, c("id",paste(V2, V1, sep = "_"))]
setcolorder(DTw, newcols)
DTw
id drugs_rx01 th_rx01 drugs_rx02 th_rx02 drugs_rx03 th_rx03 drugs_rx04 th_rx04 drugs_rx05 th_rx05 drugs_rx06 th_rx06
1: 1 A 0 B 1 C 0 D 1 E 0 A 1
2: 2 A 0 B 1 B 0 B 1 A 0 <NA> NA
3: 3 C 1 D 0 c 1 C 1 D 0 c 1
drugs_rx07 th_rx07 drugs_rx08 th_rx08 drugs_rx09 th_rx09 drugs_rx10 th_rx10 drugs_rx11 th_rx11 drugs_rx12 th_rx12
1: B 0 C 1 D 0 E 1 <NA> NA <NA> NA
2: <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA
3: C 1 D 0 c 1 C 1 D 0 c 1
顺便说一句:github Optionally order columns of multiple value.var in dcast() by RHS of formula
上有一个功能请求
编辑:自动适配sprintf()
,OP 认为如果 id
.
的行数超过 99 行,则必须在代码中更改 sprintf()
中的格式
如果事先不知道 id
中的最大行数,可以通过编程方式采用 sprintf()
:
# create another sample dataset
id <- c(rep(1,200), rep(2, 5), rep(3,12))
th <- c(rep(c(0,1),100), c(0, 1, 0, 1, 0), rep(c(1,0,1),4 ))
drugs <- c(rep(c("A","B","C","D","E"), 40), c("A", "B", "B", "B", "A"), rep(c("C","D","c"),4 ))
DT2 <- data.table(id, th, drugs)
# compute fmt programmatically
max_id_count <- DT2[, max(rowid(id))]
fmt <- max_id_count %>% log10() %>% ceiling() %>% paste0("rx%0", ., "i")
DTw <- dcast(DT2, id ~ sprintf(fmt, rowid(id)), value.var = c("th", "drugs"))
newcols <- DT2[, CJ(max_id_count %>% seq() %>% sprintf(fmt, .),
setdiff(names(.SD), "id"))][
, c("id",paste(V2, V1, sep = "_"))]
setcolorder(DTw, newcols)
DTw
id drugs_rx001 th_rx001 drugs_rx002 th_rx002 drugs_rx003 th_rx003 drugs_rx004 th_rx004 drugs_rx005 th_rx005
1: 1 A 0 B 1 C 0 D 1 E 0
2: 2 A 0 B 1 B 0 B 1 A 0
drugs_rx006 th_rx006 drugs_rx007 th_rx007 drugs_rx008 th_rx008 drugs_rx009 th_rx009 drugs_rx010 th_rx010 drugs_rx011
1: A 1 B 0 C 1 D 0 E 1 A
2: <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA <NA>
...
在此示例数据集中 max_id_count
为 200。通过将对数取以 10 为底并向上舍入,我们可以以编程方式创建拟合 fmt
参数 "rx%03i"
。
我有一个这样的DT:
id = c(rep(1,10), rep(2, 5), rep(3,12))
th = c(rep(c(0,1),5), c(0, 1, 0, 1, 0), rep(c(1,0,1),4 ))
drugs = c(rep(c("A","B","C","D","E"),2), c("A", "B", "B", "B", "A"), rep(c("C","D","c"),4 ))
DT = data.table(id, th, drugs)
DT
id th drugs seq
1: 1 0 A 1
2: 1 1 B 2
3: 1 0 C 3
4: 1 1 D 4
5: 1 0 E 5
6: 1 1 A 6
7: 1 0 B 7
8: 1 1 C 8
9: 1 0 D 9
10: 1 1 E 10
11: 2 0 A 1
12: 2 1 B 2
13: 2 0 B 3
14: 2 1 B 4
15: 2 0 A 5
16: 3 1 C 1
17: 3 0 D 2
18: 3 1 c 3
19: 3 1 C 4
20: 3 0 D 5
21: 3 1 c 6
22: 3 1 C 7
23: 3 0 D 8
24: 3 1 c 9
25: 3 1 C 10
26: 3 0 D 11
27: 3 1 c 12
我在按 id 制作计数器(“seq”)后将所有药物合二为一 observation/id
DT_wide = DT[, seq := seq(.N), by = .(id)][, dcast.data.table(.SD, id ~ paste0("rx", seq), value.var = c("th", "drugs"))]
获得:
DT_wide
id th_rx1 th_rx10 th_rx11 th_rx12 th_rx2 th_rx3 th_rx4 th_rx5 th_rx6 th_rx7 th_rx8 th_rx9 drugs_rx1 drugs_rx10 drugs_rx11 drugs_rx12 drugs_rx2 drugs_rx3 drugs_rx4 drugs_rx5 drugs_rx6 drugs_rx7 drugs_rx8 drugs_rx9
1: 1 0 1 NA NA 1 0 1 0 1 0 1 0 A E <NA> <NA> B C D E A B C D
2: 2 0 NA NA NA 1 0 1 0 NA NA NA NA A <NA> <NA> <NA> B B B A <NA> <NA> <NA> <NA>
3: 3 1 1 0 1 0 1 1 0 1 1 0 1 C C D c D c C D c C D c
所需的输出 是 DT_wide 列以这种方式排序:
"id", "th_rx1","drugs_rx1", "th_rx2", "drugs_rx2",...,"th_rx12", "drugs_rx12"
是否有更好的方法来执行 dcast,或者它需要一个 post dcast setcolorder()
和特定的正则表达式?
我尝试了 setcolorder 但没有结果,因为我发现了 _rx1 和 _rx10 问题:
setcolorder(DT_wide, c("id", grep("_rx1", colnames(DT_wide), value = TRUE)))
提前感谢您的帮助! :D
尝试以下解决方法
cols <- c("th", "drugs")
# a shorter way of achieving your dcast
# numbering comes from rowid()
DT.wide <- dcast(DT, id ~ paste0("rx", rowid(id)), value.var = cols)
# new order of colnames
new_colorder <- CJ(unique(rowid(DT$id)), cols, sorted = FALSE)[, paste(cols, V1, sep = "_rx")]
# reorder the relevant columns
setcolorder(DT.wide, c(setdiff(names(DT.wide), new_colorder), new_colorder))
# id th_rx1 drugs_rx1 th_rx2 drugs_rx2 th_rx3 drugs_rx3 th_rx4 drugs_rx4 th_rx5 drugs_rx5 th_rx6
# 1: 1 0 A 1 B 0 C 1 D 0 E 1
# 2: 2 0 A 1 B 0 B 1 B 0 A NA
# 3: 3 1 C 0 D 1 c 1 C 0 D 1
# drugs_rx6 th_rx7 drugs_rx7 th_rx8 drugs_rx8 th_rx9 drugs_rx9 th_rx10 drugs_rx10 th_rx11 drugs_rx11
# 1: A 0 B 1 C 0 D 1 E NA <NA>
# 2: <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA <NA>
# 3: c 1 C 0 D 1 c 1 C 0 D
# th_rx12 drugs_rx12
# 1: NA <NA>
# 2: NA <NA>
# 3: 1 c
与dcast()
中sprintf()
和rowid(id)
的使用:
library(data.table)
library(magrittr)
DTw <- dcast(DT, id ~ sprintf("rx%02i", rowid(id)), value.var = c("th", "drugs"))
newcols <- DT[, CJ(max(rowid(id)) %>% seq() %>% sprintf("rx%02i", .),
setdiff(names(.SD), "id"))][
, c("id",paste(V2, V1, sep = "_"))]
setcolorder(DTw, newcols)
DTw
id drugs_rx01 th_rx01 drugs_rx02 th_rx02 drugs_rx03 th_rx03 drugs_rx04 th_rx04 drugs_rx05 th_rx05 drugs_rx06 th_rx06 1: 1 A 0 B 1 C 0 D 1 E 0 A 1 2: 2 A 0 B 1 B 0 B 1 A 0 <NA> NA 3: 3 C 1 D 0 c 1 C 1 D 0 c 1 drugs_rx07 th_rx07 drugs_rx08 th_rx08 drugs_rx09 th_rx09 drugs_rx10 th_rx10 drugs_rx11 th_rx11 drugs_rx12 th_rx12 1: B 0 C 1 D 0 E 1 <NA> NA <NA> NA 2: <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA 3: C 1 D 0 c 1 C 1 D 0 c 1
顺便说一句:github Optionally order columns of multiple value.var in dcast() by RHS of formula
上有一个功能请求编辑:自动适配sprintf()
id
.
sprintf()
中的格式
如果事先不知道 id
中的最大行数,可以通过编程方式采用 sprintf()
:
# create another sample dataset
id <- c(rep(1,200), rep(2, 5), rep(3,12))
th <- c(rep(c(0,1),100), c(0, 1, 0, 1, 0), rep(c(1,0,1),4 ))
drugs <- c(rep(c("A","B","C","D","E"), 40), c("A", "B", "B", "B", "A"), rep(c("C","D","c"),4 ))
DT2 <- data.table(id, th, drugs)
# compute fmt programmatically
max_id_count <- DT2[, max(rowid(id))]
fmt <- max_id_count %>% log10() %>% ceiling() %>% paste0("rx%0", ., "i")
DTw <- dcast(DT2, id ~ sprintf(fmt, rowid(id)), value.var = c("th", "drugs"))
newcols <- DT2[, CJ(max_id_count %>% seq() %>% sprintf(fmt, .),
setdiff(names(.SD), "id"))][
, c("id",paste(V2, V1, sep = "_"))]
setcolorder(DTw, newcols)
DTw
id drugs_rx001 th_rx001 drugs_rx002 th_rx002 drugs_rx003 th_rx003 drugs_rx004 th_rx004 drugs_rx005 th_rx005 1: 1 A 0 B 1 C 0 D 1 E 0 2: 2 A 0 B 1 B 0 B 1 A 0 drugs_rx006 th_rx006 drugs_rx007 th_rx007 drugs_rx008 th_rx008 drugs_rx009 th_rx009 drugs_rx010 th_rx010 drugs_rx011 1: A 1 B 0 C 1 D 0 E 1 A 2: <NA> NA <NA> NA <NA> NA <NA> NA <NA> NA <NA> ...
在此示例数据集中 max_id_count
为 200。通过将对数取以 10 为底并向上舍入,我们可以以编程方式创建拟合 fmt
参数 "rx%03i"
。