如何将宽连续数据重塑为长分类数据?
How to reshape wide continuous data into long categorical data?
我的数据采用以下宽格式,根据 SUBJECT_ID
在行中,总共观察到变量 X
和 Y
,然后是各种元数据列,例如SUBJECT_BIRTHYEAR
, SUBJECT_HOMETOWN
:
variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown
我想将它们转换为以下长格式,其中对于变量 X
和 Y
的每次观察,对于每个 SUBJECT_ID
:
VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
X A 1950 Townsville
X A 1950 Townsville
Y A 1950 Townsville
X B 1951 Villestown
Y B 1951 Villestown
Y B 1951 Villestown
具体到我的问题是如何将连续变量的 n 观察结果转换为 n 行分类数据。
尝试以下方法
数据
df <- read.table(text="variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown", header=TRUE)
解决方案
library(tidyverse)
result <- df %>%
nest(variableX, variableY, .key="VARIABLE") %>%
mutate(VARIABLE = map(VARIABLE, function(i) {
vec <- unlist(i)
rep(gsub("variable", "", names(vec)), times=vec)
})) %>%
unnest()
# A tibble: 6 x 4
# SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN VARIABLE
# <fctr> <int> <fctr> <chr>
# 1 A 1950 Townsville X
# 2 A 1950 Townsville X
# 3 A 1950 Townsville Y
# 4 B 1951 Villestown X
# 5 B 1951 Villestown Y
# 6 B 1951 Villestown Y
这是一个使用base R
的选项
res <- cbind(VARIABLE = rep(substr(names(df1)[1:2], 9, 9)[row(df1[1:2])], t(df1[1:2])),
df1[rep(seq_len(nrow(df1)), rowSums(df1[1:2])), -(1:2)])
row.names(res) <- NULL
res
# VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
#1 X A 1950 Townsville
#2 X A 1950 Townsville
#3 Y A 1950 Townsville
#4 X B 1951 Villestown
#5 Y B 1951 Villestown
#6 Y B 1951 Villestown
该问题要求反转对 dcast()
的调用,该调用使用 length()
作为聚合函数将数据从长格式重塑为宽格式。
这可以通过调用 melt()
加上一些额外的转换来实现:
library(data.table)
# reshape wide back to long format
long <- melt(setDT(wide), measure.vars = c("variableX", "variableY"))[
# undo munging of variable names
, variable := stringr::str_replace(variable, "^variable", "")][]
# undo effect of aggregation by length()
result <- long[long[, rep(.I, value)]][
# beautify result
order(SUBJECT_ID), !"value"]
result
SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variable
1: A 1950 Townsville X
2: A 1950 Townsville X
3: A 1950 Townsville Y
4: B 1951 Villestown X
5: B 1951 Villestown Y
6: B 1951 Villestown Y
.I
是一个特殊的符号,它保存了行的位置,即行索引。
为了证明这确实是逆运算,result
可以再次整形重现wide
:
dcast(result, ... ~ paste0("variable", variable), length, value.var = "variable")
SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variableX variableY
1: A 1950 Townsville 2 1
2: B 1951 Villestown 1 2
数据
library(data.table)
wide <- fread("variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown")
我的数据采用以下宽格式,根据 SUBJECT_ID
在行中,总共观察到变量 X
和 Y
,然后是各种元数据列,例如SUBJECT_BIRTHYEAR
, SUBJECT_HOMETOWN
:
variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown
我想将它们转换为以下长格式,其中对于变量 X
和 Y
的每次观察,对于每个 SUBJECT_ID
:
VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
X A 1950 Townsville
X A 1950 Townsville
Y A 1950 Townsville
X B 1951 Villestown
Y B 1951 Villestown
Y B 1951 Villestown
具体到我的问题是如何将连续变量的 n 观察结果转换为 n 行分类数据。
尝试以下方法
数据
df <- read.table(text="variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown", header=TRUE)
解决方案
library(tidyverse)
result <- df %>%
nest(variableX, variableY, .key="VARIABLE") %>%
mutate(VARIABLE = map(VARIABLE, function(i) {
vec <- unlist(i)
rep(gsub("variable", "", names(vec)), times=vec)
})) %>%
unnest()
# A tibble: 6 x 4
# SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN VARIABLE
# <fctr> <int> <fctr> <chr>
# 1 A 1950 Townsville X
# 2 A 1950 Townsville X
# 3 A 1950 Townsville Y
# 4 B 1951 Villestown X
# 5 B 1951 Villestown Y
# 6 B 1951 Villestown Y
这是一个使用base R
res <- cbind(VARIABLE = rep(substr(names(df1)[1:2], 9, 9)[row(df1[1:2])], t(df1[1:2])),
df1[rep(seq_len(nrow(df1)), rowSums(df1[1:2])), -(1:2)])
row.names(res) <- NULL
res
# VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
#1 X A 1950 Townsville
#2 X A 1950 Townsville
#3 Y A 1950 Townsville
#4 X B 1951 Villestown
#5 Y B 1951 Villestown
#6 Y B 1951 Villestown
该问题要求反转对 dcast()
的调用,该调用使用 length()
作为聚合函数将数据从长格式重塑为宽格式。
这可以通过调用 melt()
加上一些额外的转换来实现:
library(data.table)
# reshape wide back to long format
long <- melt(setDT(wide), measure.vars = c("variableX", "variableY"))[
# undo munging of variable names
, variable := stringr::str_replace(variable, "^variable", "")][]
# undo effect of aggregation by length()
result <- long[long[, rep(.I, value)]][
# beautify result
order(SUBJECT_ID), !"value"]
result
SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variable 1: A 1950 Townsville X 2: A 1950 Townsville X 3: A 1950 Townsville Y 4: B 1951 Villestown X 5: B 1951 Villestown Y 6: B 1951 Villestown Y
.I
是一个特殊的符号,它保存了行的位置,即行索引。
为了证明这确实是逆运算,result
可以再次整形重现wide
:
dcast(result, ... ~ paste0("variable", variable), length, value.var = "variable")
SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variableX variableY 1: A 1950 Townsville 2 1 2: B 1951 Villestown 1 2
数据
library(data.table)
wide <- fread("variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown")