如何将宽连续数据重塑为长分类数据?

How to reshape wide continuous data into long categorical data?

我的数据采用以下宽格式,根据 SUBJECT_ID 在行中,总共观察到变量 XY,然后是各种元数据列,例如SUBJECT_BIRTHYEAR, SUBJECT_HOMETOWN:

variableX    variableY    SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
2            1            A              1950                  Townsville
1            2            B              1951                  Villestown

我想将它们转换为以下长格式,其中对于变量 XY 的每次观察,对于每个 SUBJECT_ID:

VARIABLE     SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
X            A              1950                  Townsville
X            A              1950                  Townsville
Y            A              1950                  Townsville
X            B              1951                  Villestown
Y            B              1951                  Villestown
Y            B              1951                  Villestown

具体到我的问题是如何将连续变量的 n 观察结果转换为 n 行分类数据。

尝试以下方法

数据

df <- read.table(text="variableX    variableY    SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
2            1            A              1950                  Townsville
1            2            B              1951                  Villestown", header=TRUE)

解决方案

library(tidyverse)
result <- df %>%
        nest(variableX, variableY, .key="VARIABLE") %>%
        mutate(VARIABLE = map(VARIABLE, function(i) {
                                    vec <- unlist(i)
                                    rep(gsub("variable", "", names(vec)), times=vec)
                                })) %>%
        unnest()

# A tibble: 6 x 4
  # SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN VARIABLE
      # <fctr>             <int>           <fctr>    <chr>
# 1          A              1950       Townsville        X
# 2          A              1950       Townsville        X
# 3          A              1950       Townsville        Y
# 4          B              1951       Villestown        X
# 5          B              1951       Villestown        Y
# 6          B              1951       Villestown        Y

这是一个使用base R

的选项
res <- cbind(VARIABLE = rep(substr(names(df1)[1:2], 9, 9)[row(df1[1:2])], t(df1[1:2])), 
        df1[rep(seq_len(nrow(df1)), rowSums(df1[1:2])), -(1:2)])
row.names(res) <- NULL
res
#   VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
#1        X          A              1950       Townsville
#2        X          A              1950       Townsville
#3        Y          A              1950       Townsville
#4        X          B              1951       Villestown
#5        Y          B              1951       Villestown
#6        Y          B              1951       Villestown

该问题要求反转对 dcast() 的调用,该调用使用 length() 作为聚合函数将数据从长格式重塑为宽格式。

这可以通过调用 melt() 加上一些额外的转换来实现:

library(data.table)
# reshape wide back to long format
long <- melt(setDT(wide), measure.vars = c("variableX", "variableY"))[
  # undo munging of variable names
  , variable := stringr::str_replace(variable, "^variable", "")][]
# undo effect of aggregation by length()
result <- long[long[, rep(.I, value)]][
  # beautify result
  order(SUBJECT_ID), !"value"]
result
   SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variable
1:          A              1950       Townsville        X
2:          A              1950       Townsville        X
3:          A              1950       Townsville        Y
4:          B              1951       Villestown        X
5:          B              1951       Villestown        Y
6:          B              1951       Villestown        Y

.I是一个特殊的符号,它保存了行的位置,即行索引。


为了证明这确实是逆运算,result可以再次整形重现wide:

dcast(result, ... ~ paste0("variable", variable), length, value.var = "variable")
   SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variableX variableY
1:          A              1950       Townsville         2         1
2:          B              1951       Villestown         1         2

数据

library(data.table)
wide <- fread("variableX    variableY    SUBJECT_ID     SUBJECT_BIRTHYEAR     SUBJECT_HOMETOWN
2            1            A              1950                  Townsville
1            2            B              1951                  Villestown")