如何将长数据集重塑为具有多个变量的短数据集

How can I reshape a long dataset into a short data set with multiple variables

**更新

我的数据集包含以下格式的 314090 个观测值:

UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
16 04999120040277 2.879744e-03 0.15326902 supervises collective followers very closely 1

注意:有重复的BRIEF_ID个数字(3205个独特的#s),LDR_Topics(15个独特的LDR主题对应LDR_7Codes)所以这就是数据很长的原因.

我想重塑此数据,其中每一行都是唯一的 #s(3205 行),每个 LDR_Topic (15) 是它自己的唯一列(总共 20 列)及其对应的 LDR_SUM 作为列中的值。例如:

UPDATEDID BRIEF_ID supervises collective followers very closely
16 04999120040277 0.15326902

到目前为止我已经尝试过:

BriefingGammas4<-reshape(data = BriefingGammas3, 
                         idvar = c("UPDATEDID", "BRIEF_ID"),
                         timevar = "LDR_Topic", 
                         direction = "wide")

但它中止进入新会话。

有什么建议吗?谢谢!

***** 更新

我尝试了以下方法,但都没有得到正确的 table。

install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

这导致了正确的 3205 行,但是每个 LDR_Topic 的值不正确(它们不应该相同,应该是小数。这些数字似乎反映了 LDR_7Code 而不是数据集)。请参阅下面的示例:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 2 4 15
2 01999120041284 2 4 15
3 01999300213 2 4 15

然后我尝试了这个:

install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

这导致每个 LDR_Topic 的值正确,但行数不正确。它停留在 314,090 行而不是 3205 行。参见示例:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 0.02843241 NA NA
2 01999110036250 NA 0.010892233 NA
3 01999110036250 NA 0.010892233 0.006081761
4 01999110036250 0.02843241 NA 0.006081761

基本上,它为每个主题填写了 3205 行的值(重复多次),然后开始为下一个主题填写值。但我想让 3205 行看起来像这样:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 0.02843241 0.010892233 0.006081761
2 01999120041284 0.1594207 0.005315201 0.004850703
3 01999300213 0.4374699 0.01607505 0.003971634

我最后试的是这个:

BriefingGammas7<-reshape(data = BriefingGammas6, 
                         idvar = c("UPDATEDID", "BRIEF_ID"),
                         timevar = "LDR_Topic",
                         v.names = "LDR_SUM",
                         direction = "wide")

结果是:

UPDATEDID BRIEF_ID "acquired resources", "distributed"...
1 01999110036250 NA
2 01999120041284 NA

没有其他行出来。

解决方案更新*

步骤 1. 减少变量数量 第 2 步。删除重复的观察结果

BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))

第3步.在下面的评论中使用整洁的诗歌方式。

BriefingGammas8 <- BriefingGammas7 |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)

为了使案例更清楚,我尝试使用遵循第一行数据模式的虚拟数据创建第二行:

dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277", 
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902, 
0.25326902), LDR_Topic = c("supervises collective followers very closely", 
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA, 
-2L))

dat
  UPDATEDID       BRIEF_ID   gamma  LDR_SUM                                    LDR_Topic LDR_7Code
1        16 04999120040277 879.744 0.153269 supervises collective followers very closely         1
2        17 14999120040277 779.744 0.253269                          does something else         2

基本R方式

dat |> 
  reshape(direction = "wide", 
          idvar  = "UPDATEDID",
          timevar ="LDR_Topic",
          v.names = "LDR_SUM")|>
  subset(select = -c(gamma, LDR_7Code))

# The result

#  UPDATEDID       BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1        16 04999120040277                                             0.153269                          NA
#2        17 14999120040277                                                   NA                    0.253269

一种整洁的方法

library(tidyverse)

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

#The result

# A tibble: 2 × 4
#  UPDATEDID BRIEF_ID       `supervises collective followers very closely` `does something else`
#      <dbl> <chr>                                                   <dbl>                 <dbl>
#1        16 04999120040277                                          0.153                NA    
#2        17 14999120040277                                         NA                     0.253

一种data.table方式(推荐内存效率)

library(data.table)

dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

# The result

#   UPDATEDID       BRIEF_ID does something else supervises collective followers very closely
#1:        16 04999120040277                  NA                                     0.153269
#2:        17 14999120040277            0.253269                                           NA

更新

根据你的解释,tidyverse的方法基本上是对的。唯一的问题是某些列中有 NA 的重复行,您希望它们折叠成一行。使用 fill()distinct() 函数很容易做到这一点。您的示例中唯一的问题是 UPDATEDID1,2,3,4 更改为 1 且没有任何解释。因此,现在,我假设我们可以忽略 UPDATEDID(您可以稍后为它创建一个新列),我们只需要考虑 BRIEF_ID.

yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250, 
1999110036250, 1999110036250), acquired_resources = c(0.02843241, 
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233, 
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))

yourdf   # I change the space to '_' to make it easier to control

  UPDATEDID    BRIEF_ID acquired_resources distributed_resources    enhanced
1         1 1.99911e+12         0.02843241                    NA          NA
2         2 1.99911e+12                 NA            0.01089223          NA
3         3 1.99911e+12                 NA            0.01089223 0.006081761
4         4 1.99911e+12         0.02843241                    NA 0.006081761

yourdf[,-1] |>
     fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()
    

# The result
     BRIEF_ID acquired_resources distributed_resources    enhanced
1 1.99911e+12         0.02843241            0.01089223 0.006081761 

那么,完整的步骤就是:

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code)) |>
 fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()