通过多个变量将宽数据重塑为高数据

Reshaping wide to tall data over multiple variables

目前我的数据是这样的:

wide.df <- read.table(header = T, sep = ",", text = "
ID, left.mid.brain, right.mid.brain, left.lat.brain, right.lat.brain, score, group
100, 18 , 4, 29, 30, 40, 0
101, 19,  7, 33, 40, 29, 0
103, 19, 19, 22, 30, 33, 0
200, 29, 30, 22, 33, 11, 1
233, 100, 33, 22, 44, 55, 1")

我需要将我的数据转换成长格式,如下所示:

ID  group  left.or.right  mid.or.lat    brain     score
100   0          0             0           29        40   # 0 = left, 0=lat 
100   0          1             0           30        40   # 1 = right, 0=lat
100   0          0             1           18        40   # 0 = left, 1 = mid
100   0          1             1            4        40   # 1 = right, 1 = mid
101   0          0             0           33        29   # 0 = left, 0 = lat
.
.
.
.
.
233   1           1            1            33        55   # 1= right, 1= mid

其中 left.mid.brain , right.mid.brain , left.lat.brain, right.lat.brain 变为因子,但它们的值仍然保留,每个参与者各有四行。

tidyverse(特别是 dplyrtidyr 包)非常擅长操作像这样:

library(tidyverse)

long.df <- wide.df %>% 
  gather(variable, brain, left.mid.brain, right.mid.brain, left.lat.brain, right.lat.brain) %>% 
  mutate(
    left.or.right = ifelse(grepl('left', variable), 0, 1),
    mid.or.lat = ifelse(grepl('lat', variable), 0, 1)
  ) %>% 
  select(ID, group, left.or.right, mid.or.lat, brain, score) %>% 
  arrange(ID)

    ID group left.or.right mid.or.lat brain score
1  100     0             0          1    18    40
2  100     0             1          1     4    40
3  100     0             0          0    29    40
4  100     0             1          0    30    40
5  101     0             0          1    19    29
6  101     0             1          1     7    29
7  101     0             0          0    33    29
8  101     0             1          0    40    29
9  103     0             0          1    19    33
10 103     0             1          1    19    33

另一种基于 dplyr/tidyr 的方法,应该可以很好地扩展。创建长形数据后,您将拥有像 "right.mid.brain" 这样的值的列,您希望将其拆分为 "right""mid"dplyr::separate 可以轻松做到这一点,拆分"\." 并避免过多的硬编码。它为您提供了一个虚拟列,稍后我将删除它。

届时,您将拥有:

library(dplyr)
library(tidyr)

# 0 = left, 0 = lat 
wide %>%
  gather(key, value = brain, -ID, -score, -group) %>%
  separate(key, into = c("left.or.right", "mid.or.lat", "dummy"), sep = "\.") %>%
  head()
#>    ID score group left.or.right mid.or.lat dummy brain
#> 1 100    40     0          left        mid brain    18
#> 2 101    29     0          left        mid brain    19
#> 3 103    33     0          left        mid brain    19
#> 4 200    11     1          left        mid brain    29
#> 5 233    55     1          left        mid brain   100
#> 6 100    40     0         right        mid brain     4

如果您需要进行更复杂的重新编码,您可以使用一些 forcats 函数来重新编码因子水平。在这种情况下,只需根据 left.or.right == "right" 等条件转换列就足够简单了,如果 true 变为 1,如果为 false(即,如果它剩下),则为 0。Select 按您的顺序排列列想要。

long <- wide %>%
  gather(key, value = brain, -ID, -score, -group) %>%
  separate(key, into = c("left.or.right", "mid.or.lat", "dummy"), sep = "\.") %>%
  mutate(left.or.right = as.numeric(left.or.right == "right"),
         mid.or.lat = as.numeric(mid.or.lat == "mid")) %>%
  select(ID, group, left.or.right, mid.or.lat, brain, score) %>%
  arrange(ID)

head(long)
#>    ID group left.or.right mid.or.lat brain score
#> 1 100     0             0          1    18    40
#> 2 100     0             1          1     4    40
#> 3 100     0             0          0    29    40
#> 4 100     0             1          0    30    40
#> 5 101     0             0          1    19    29
#> 6 101     0             1          1     7    29