如何将长数据集重塑为具有多个变量的短数据集
How can I reshape a long dataset into a short data set with multiple variables
**更新
我的数据集包含以下格式的 314090 个观测值:
UPDATEDID
BRIEF_ID
gamma
LDR_SUM
LDR_Topic
LDR_7Code
16
04999120040277
2.879744e-03
0.15326902
supervises collective followers very closely
1
注意:有重复的BRIEF_ID个数字(3205个独特的#s),LDR_Topics(15个独特的LDR主题对应LDR_7Codes)所以这就是数据很长的原因.
我想重塑此数据,其中每一行都是唯一的 #s(3205 行),每个 LDR_Topic (15) 是它自己的唯一列(总共 20 列)及其对应的 LDR_SUM 作为列中的值。例如:
UPDATEDID
BRIEF_ID
supervises collective followers very closely
16
04999120040277
0.15326902
到目前为止我已经尝试过:
BriefingGammas4<-reshape(data = BriefingGammas3,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
direction = "wide")
但它中止进入新会话。
有什么建议吗?谢谢!
***** 更新
我尝试了以下方法,但都没有得到正确的 table。
install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
这导致了正确的 3205 行,但是每个 LDR_Topic 的值不正确(它们不应该相同,应该是小数。这些数字似乎反映了 LDR_7Code 而不是数据集)。请参阅下面的示例:
UPDATEDID
BRIEF_ID
acquired resources
distributed resources
enhanced
1
01999110036250
2
4
15
2
01999120041284
2
4
15
3
01999300213
2
4
15
然后我尝试了这个:
install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
这导致每个 LDR_Topic 的值正确,但行数不正确。它停留在 314,090 行而不是 3205 行。参见示例:
UPDATEDID
BRIEF_ID
acquired resources
distributed resources
enhanced
1
01999110036250
0.02843241
NA
NA
2
01999110036250
NA
0.010892233
NA
3
01999110036250
NA
0.010892233
0.006081761
4
01999110036250
0.02843241
NA
0.006081761
基本上,它为每个主题填写了 3205 行的值(重复多次),然后开始为下一个主题填写值。但我想让 3205 行看起来像这样:
UPDATEDID
BRIEF_ID
acquired resources
distributed resources
enhanced
1
01999110036250
0.02843241
0.010892233
0.006081761
2
01999120041284
0.1594207
0.005315201
0.004850703
3
01999300213
0.4374699
0.01607505
0.003971634
我最后试的是这个:
BriefingGammas7<-reshape(data = BriefingGammas6,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
v.names = "LDR_SUM",
direction = "wide")
结果是:
UPDATEDID
BRIEF_ID
"acquired resources", "distributed"...
1
01999110036250
NA
2
01999120041284
NA
没有其他行出来。
解决方案更新*
步骤 1. 减少变量数量
第 2 步。删除重复的观察结果
BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))
第3步.在下面的评论中使用整洁的诗歌方式。
BriefingGammas8 <- BriefingGammas7 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)
为了使案例更清楚,我尝试使用遵循第一行数据模式的虚拟数据创建第二行:
dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277",
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902,
0.25326902), LDR_Topic = c("supervises collective followers very closely",
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA,
-2L))
dat
UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
1 16 04999120040277 879.744 0.153269 supervises collective followers very closely 1
2 17 14999120040277 779.744 0.253269 does something else 2
基本R方式
dat |>
reshape(direction = "wide",
idvar = "UPDATEDID",
timevar ="LDR_Topic",
v.names = "LDR_SUM")|>
subset(select = -c(gamma, LDR_7Code))
# The result
# UPDATEDID BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1 16 04999120040277 0.153269 NA
#2 17 14999120040277 NA 0.253269
一种整洁的方法
library(tidyverse)
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
#The result
# A tibble: 2 × 4
# UPDATEDID BRIEF_ID `supervises collective followers very closely` `does something else`
# <dbl> <chr> <dbl> <dbl>
#1 16 04999120040277 0.153 NA
#2 17 14999120040277 NA 0.253
一种data.table方式(推荐内存效率)
library(data.table)
dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
# The result
# UPDATEDID BRIEF_ID does something else supervises collective followers very closely
#1: 16 04999120040277 NA 0.153269
#2: 17 14999120040277 0.253269 NA
更新
根据你的解释,tidyverse
的方法基本上是对的。唯一的问题是某些列中有 NA
的重复行,您希望它们折叠成一行。使用 fill()
和 distinct()
函数很容易做到这一点。您的示例中唯一的问题是 UPDATEDID
从 1,2,3,4
更改为 1
且没有任何解释。因此,现在,我假设我们可以忽略 UPDATEDID
(您可以稍后为它创建一个新列),我们只需要考虑 BRIEF_ID
.
yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250,
1999110036250, 1999110036250), acquired_resources = c(0.02843241,
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233,
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))
yourdf # I change the space to '_' to make it easier to control
UPDATEDID BRIEF_ID acquired_resources distributed_resources enhanced
1 1 1.99911e+12 0.02843241 NA NA
2 2 1.99911e+12 NA 0.01089223 NA
3 3 1.99911e+12 NA 0.01089223 0.006081761
4 4 1.99911e+12 0.02843241 NA 0.006081761
yourdf[,-1] |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
# The result
BRIEF_ID acquired_resources distributed_resources enhanced
1 1.99911e+12 0.02843241 0.01089223 0.006081761
那么,完整的步骤就是:
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code)) |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
**更新
我的数据集包含以下格式的 314090 个观测值:
UPDATEDID | BRIEF_ID | gamma | LDR_SUM | LDR_Topic | LDR_7Code |
---|---|---|---|---|---|
16 | 04999120040277 | 2.879744e-03 | 0.15326902 | supervises collective followers very closely | 1 |
注意:有重复的BRIEF_ID个数字(3205个独特的#s),LDR_Topics(15个独特的LDR主题对应LDR_7Codes)所以这就是数据很长的原因.
我想重塑此数据,其中每一行都是唯一的 #s(3205 行),每个 LDR_Topic (15) 是它自己的唯一列(总共 20 列)及其对应的 LDR_SUM 作为列中的值。例如:
UPDATEDID | BRIEF_ID | supervises collective followers very closely |
---|---|---|
16 | 04999120040277 | 0.15326902 |
到目前为止我已经尝试过:
BriefingGammas4<-reshape(data = BriefingGammas3,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
direction = "wide")
但它中止进入新会话。
有什么建议吗?谢谢!
***** 更新
我尝试了以下方法,但都没有得到正确的 table。
install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
这导致了正确的 3205 行,但是每个 LDR_Topic 的值不正确(它们不应该相同,应该是小数。这些数字似乎反映了 LDR_7Code 而不是数据集)。请参阅下面的示例:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 2 | 4 | 15 |
2 | 01999120041284 | 2 | 4 | 15 |
3 | 01999300213 | 2 | 4 | 15 |
然后我尝试了这个:
install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
这导致每个 LDR_Topic 的值正确,但行数不正确。它停留在 314,090 行而不是 3205 行。参见示例:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 0.02843241 | NA | NA |
2 | 01999110036250 | NA | 0.010892233 | NA |
3 | 01999110036250 | NA | 0.010892233 | 0.006081761 |
4 | 01999110036250 | 0.02843241 | NA | 0.006081761 |
基本上,它为每个主题填写了 3205 行的值(重复多次),然后开始为下一个主题填写值。但我想让 3205 行看起来像这样:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 0.02843241 | 0.010892233 | 0.006081761 |
2 | 01999120041284 | 0.1594207 | 0.005315201 | 0.004850703 |
3 | 01999300213 | 0.4374699 | 0.01607505 | 0.003971634 |
我最后试的是这个:
BriefingGammas7<-reshape(data = BriefingGammas6,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
v.names = "LDR_SUM",
direction = "wide")
结果是:
UPDATEDID | BRIEF_ID | "acquired resources", "distributed"... |
---|---|---|
1 | 01999110036250 | NA |
2 | 01999120041284 | NA |
没有其他行出来。
解决方案更新*
步骤 1. 减少变量数量 第 2 步。删除重复的观察结果
BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))
第3步.在下面的评论中使用整洁的诗歌方式。
BriefingGammas8 <- BriefingGammas7 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)
为了使案例更清楚,我尝试使用遵循第一行数据模式的虚拟数据创建第二行:
dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277",
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902,
0.25326902), LDR_Topic = c("supervises collective followers very closely",
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA,
-2L))
dat
UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
1 16 04999120040277 879.744 0.153269 supervises collective followers very closely 1
2 17 14999120040277 779.744 0.253269 does something else 2
基本R方式
dat |>
reshape(direction = "wide",
idvar = "UPDATEDID",
timevar ="LDR_Topic",
v.names = "LDR_SUM")|>
subset(select = -c(gamma, LDR_7Code))
# The result
# UPDATEDID BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1 16 04999120040277 0.153269 NA
#2 17 14999120040277 NA 0.253269
一种整洁的方法
library(tidyverse)
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
#The result
# A tibble: 2 × 4
# UPDATEDID BRIEF_ID `supervises collective followers very closely` `does something else`
# <dbl> <chr> <dbl> <dbl>
#1 16 04999120040277 0.153 NA
#2 17 14999120040277 NA 0.253
一种data.table方式(推荐内存效率)
library(data.table)
dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
# The result
# UPDATEDID BRIEF_ID does something else supervises collective followers very closely
#1: 16 04999120040277 NA 0.153269
#2: 17 14999120040277 0.253269 NA
更新
根据你的解释,tidyverse
的方法基本上是对的。唯一的问题是某些列中有 NA
的重复行,您希望它们折叠成一行。使用 fill()
和 distinct()
函数很容易做到这一点。您的示例中唯一的问题是 UPDATEDID
从 1,2,3,4
更改为 1
且没有任何解释。因此,现在,我假设我们可以忽略 UPDATEDID
(您可以稍后为它创建一个新列),我们只需要考虑 BRIEF_ID
.
yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250,
1999110036250, 1999110036250), acquired_resources = c(0.02843241,
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233,
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))
yourdf # I change the space to '_' to make it easier to control
UPDATEDID BRIEF_ID acquired_resources distributed_resources enhanced
1 1 1.99911e+12 0.02843241 NA NA
2 2 1.99911e+12 NA 0.01089223 NA
3 3 1.99911e+12 NA 0.01089223 0.006081761
4 4 1.99911e+12 0.02843241 NA 0.006081761
yourdf[,-1] |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
# The result
BRIEF_ID acquired_resources distributed_resources enhanced
1 1.99911e+12 0.02843241 0.01089223 0.006081761
那么,完整的步骤就是:
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code)) |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()