在 R 中有效地重塑非标准虚拟编码矩阵或 table
Efficiently reshaping a non-standard dummy-coded matrix or table in R
我有一个包含数十万行和 6 列的数据框。每列包含 ID(总共大约有 500 个唯一 ID)。我想将这个数据框转换成一个大的 table/matrix,每个唯一 ID 都有自己的列,并且每个条目都有一个 -1、0 或 1,对应于以下逻辑:如果 ID 不是,则为 0存在,如果 ID 在前 3 列,则为 -1;如果 ID 在后 3 列,则为 1。
我可以使用蛮力方法,逐行遍历每一行,但我正在寻找一种更快、更完善的方法来执行此操作。我的偏好是使用 dplyr 解决方案,假设存在的话。我猜还有一种使用 data.table 或什至只是简单的基本 R 方法来执行此操作的绝妙方法。如有任何帮助,我们将不胜感激!
提前致谢。这是我的数据的示例:
df <- data.frame(matrix(c("XX001","XX002","XX003","XX007","XX008","XX009",
"XX001","XX004","XX005","XX006","XX010","XX008",
"XX003","XX002","XX005","XX008","XX009","XX010",
"XX002","XX005","XX003","XX009","XX007","XX010",
"XX001","XX002","XX004","XX007","XX009","XX006"),
nrow=5, ncol=6, byrow=1))
names(df) <- c("ID_X1","ID_X2","ID_X3","ID_Y1","ID_Y2","ID_Y3")
df
> df
ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX002 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006
下面是我希望输出的样子:
> yay
XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010 ... XX500
1 -1 -1 -1 0 0 0 1 1 1 0 ... 0
2 -1 0 0 -1 -1 1 0 1 0 1 ... 0
3 0 -1 -1 0 -1 0 0 1 1 1 ... 0
4 0 -1 -1 0 -1 0 1 0 1 1 ... 0
5 -1 -1 0 -1 0 1 1 0 1 0 ... 0
试试这个:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(Val=ifelse(row_number() %in% 1:3,-1,
ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
select(-name) %>%
pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
ungroup() %>%
select(-id)
输出:
# A tibble: 5 x 10
XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 -1 -1 0 0 0 1 1 1 0
2 -1 0 0 -1 -1 1 0 1 0 1
3 0 -1 -1 0 -1 0 0 1 1 1
4 0 -1 -1 0 -1 0 1 0 1 1
5 -1 -1 0 -1 0 1 1 0 1 0
更新: 由于 OP 有重复值的问题,这里有一个可能的草图来解决这个任务。首先是一个虚拟数据:
df2
ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX001 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006
我们可以看到第一行有重复。因此,我们可以创建一个索引来区分重复值。这里的代码:
#Code 2
newdf <- df2 %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(Val=ifelse(row_number() %in% 1:3,-1,
ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
ungroup() %>%
group_by(id,value) %>%
mutate(value=paste0(value,'.',row_number())) %>%
select(-name) %>%
pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
ungroup() %>%
select(-id)
输出:
# A tibble: 5 x 11
XX001.1 XX001.2 XX002.1 XX003.1 XX004.1 XX005.1 XX006.1 XX007.1 XX008.1 XX009.1 XX010.1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 -1 0 -1 0 0 0 1 1 1 0
2 -1 0 0 0 -1 -1 1 0 1 0 1
3 0 0 -1 -1 0 -1 0 0 1 1 1
4 0 0 -1 -1 0 -1 0 1 0 1 1
5 -1 0 -1 0 -1 0 1 1 0 1 0
这样就保留了重复的值
这是一个向量化的解决方案:
id <- as.character(as.matrix(df)) %>% unique(.)
id <- id[order(id)]
match_id <- function(x) match(id,x)
yay <- as.data.frame(t(apply(df,1,match_id)))
names(yay) <- id
yay[yay<=3] <- -1
yay[yay>3] <- 1
yay[is.na(yay)] <- 0
输出:
yay
# XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
# 1 -1 -1 -1 0 0 0 1 1 1 0
# 2 -1 0 0 -1 -1 1 0 1 0 1
# 3 0 -1 -1 0 -1 0 0 1 1 1
# 4 0 -1 -1 0 -1 0 1 0 1 1
# 5 -1 -1 0 -1 0 1 1 0 1 0
我有一个包含数十万行和 6 列的数据框。每列包含 ID(总共大约有 500 个唯一 ID)。我想将这个数据框转换成一个大的 table/matrix,每个唯一 ID 都有自己的列,并且每个条目都有一个 -1、0 或 1,对应于以下逻辑:如果 ID 不是,则为 0存在,如果 ID 在前 3 列,则为 -1;如果 ID 在后 3 列,则为 1。
我可以使用蛮力方法,逐行遍历每一行,但我正在寻找一种更快、更完善的方法来执行此操作。我的偏好是使用 dplyr 解决方案,假设存在的话。我猜还有一种使用 data.table 或什至只是简单的基本 R 方法来执行此操作的绝妙方法。如有任何帮助,我们将不胜感激!
提前致谢。这是我的数据的示例:
df <- data.frame(matrix(c("XX001","XX002","XX003","XX007","XX008","XX009",
"XX001","XX004","XX005","XX006","XX010","XX008",
"XX003","XX002","XX005","XX008","XX009","XX010",
"XX002","XX005","XX003","XX009","XX007","XX010",
"XX001","XX002","XX004","XX007","XX009","XX006"),
nrow=5, ncol=6, byrow=1))
names(df) <- c("ID_X1","ID_X2","ID_X3","ID_Y1","ID_Y2","ID_Y3")
df
> df
ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX002 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006
下面是我希望输出的样子:
> yay
XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010 ... XX500
1 -1 -1 -1 0 0 0 1 1 1 0 ... 0
2 -1 0 0 -1 -1 1 0 1 0 1 ... 0
3 0 -1 -1 0 -1 0 0 1 1 1 ... 0
4 0 -1 -1 0 -1 0 1 0 1 1 ... 0
5 -1 -1 0 -1 0 1 1 0 1 0 ... 0
试试这个:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(Val=ifelse(row_number() %in% 1:3,-1,
ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
select(-name) %>%
pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
ungroup() %>%
select(-id)
输出:
# A tibble: 5 x 10
XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 -1 -1 0 0 0 1 1 1 0
2 -1 0 0 -1 -1 1 0 1 0 1
3 0 -1 -1 0 -1 0 0 1 1 1
4 0 -1 -1 0 -1 0 1 0 1 1
5 -1 -1 0 -1 0 1 1 0 1 0
更新: 由于 OP 有重复值的问题,这里有一个可能的草图来解决这个任务。首先是一个虚拟数据:
df2
ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX001 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006
我们可以看到第一行有重复。因此,我们可以创建一个索引来区分重复值。这里的代码:
#Code 2
newdf <- df2 %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(Val=ifelse(row_number() %in% 1:3,-1,
ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
ungroup() %>%
group_by(id,value) %>%
mutate(value=paste0(value,'.',row_number())) %>%
select(-name) %>%
pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
ungroup() %>%
select(-id)
输出:
# A tibble: 5 x 11
XX001.1 XX001.2 XX002.1 XX003.1 XX004.1 XX005.1 XX006.1 XX007.1 XX008.1 XX009.1 XX010.1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 -1 0 -1 0 0 0 1 1 1 0
2 -1 0 0 0 -1 -1 1 0 1 0 1
3 0 0 -1 -1 0 -1 0 0 1 1 1
4 0 0 -1 -1 0 -1 0 1 0 1 1
5 -1 0 -1 0 -1 0 1 1 0 1 0
这样就保留了重复的值
这是一个向量化的解决方案:
id <- as.character(as.matrix(df)) %>% unique(.)
id <- id[order(id)]
match_id <- function(x) match(id,x)
yay <- as.data.frame(t(apply(df,1,match_id)))
names(yay) <- id
yay[yay<=3] <- -1
yay[yay>3] <- 1
yay[is.na(yay)] <- 0
输出:
yay
# XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
# 1 -1 -1 -1 0 0 0 1 1 1 0
# 2 -1 0 0 -1 -1 1 0 1 0 1
# 3 0 -1 -1 0 -1 0 0 1 1 1
# 4 0 -1 -1 0 -1 0 1 0 1 1
# 5 -1 -1 0 -1 0 1 1 0 1 0