从面板格式到宽格式再回到 R data.table:如何保留变量名?
Going from panel format to wide format and back in R data.table: How to preserve variable names?
我有一个包含大量时间段和单位的面板数据集,以及我在每个时间段和单位观察到的大量变量。
由于我想对每个单元和变量应用单变量时间序列操作,我必须将面板数据转换为宽格式(使用 data.table::dcast),以便现在每列显示一个变量对于给定的时间单位。
应用我的时间序列观察后,我想回到 "long" 面板格式(使用 data.table::melt),但是,在这里,我丢失了有关单位名称和变量的信息.由于 data.table 相当大,我担心在这里混淆数据,这就是为什么我想找到一个保留变量和值名称的熔化操作。
考虑以下示例面板数据集:
require(data.table)
dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2012-01-01"), by = "month")
id <- paste0(c("A", "B", "C"), 1:10)
DT <- data.table(
time = rep(dates, 10),
idx = rep(id, each = 60),
String1 = runif(600),
String2 = runif(600),
String3 = runif(600)
)
time idx String1 String2 String3
1: 2007-02-01 A1 0.5412122 0.23502234 0.3858354
2: 2007-03-01 A1 0.3248168 0.32884580 0.7183147
3: 2007-04-01 A1 0.4183034 0.40781723 0.7438458
4: 2007-05-01 A1 0.3597997 0.51745402 0.1660566
5: 2007-06-01 A1 0.6405351 0.96121729 0.7786483
---
596: 2011-09-01 A10 0.7896711 0.64740298 0.8285408
597: 2011-10-01 A10 0.6582652 0.83986453 0.1292342
598: 2011-11-01 A10 0.1110465 0.41741672 0.7076345
599: 2011-12-01 A10 0.5108850 0.02940229 0.9038370
600: 2012-01-01 A10 0.2605052 0.10136480 0.3881788
我正在将此面板数据集转换为宽格式。在对其应用一些时间序列操作后(此处未显示),如果没有足够的数据,我将不得不删除一些列。然后,我将数据恢复为长格式
variable_names <- names(DT[,-c("time", "idx")])
DT_long <- dcast(DT, time ~ idx, value.var = variable_names)
DT_long <- DT_long[,-(5:10)]
DT_wide <- melt(DT_long, measure = patterns("^String1", "^String2", "^String3"), value.name = variable_names, variable.name = "idx)
time idx String1 String2 String3
1: 2007-02-01 1 0.9794707 0.5290352 0.68009050
2: 2007-03-01 1 0.4016173 0.9229200 0.38652407
3: 2007-04-01 1 0.9475505 0.5956701 0.24686007
4: 2007-05-01 1 0.6465847 0.8233340 0.08008369
5: 2007-06-01 1 0.5704834 0.8232598 0.85790038
---
596: 2011-09-01 10 NA 0.5525413 0.79994190
597: 2011-10-01 10 NA 0.3895864 0.41347910
598: 2011-11-01 10 NA 0.3123646 0.44461146
599: 2011-12-01 10 NA 0.2148686 0.37609448
600: 2012-01-01 10 NA 0.7314114 0.47138012
DT_wide 现在看起来像这样,这意味着我丢失了有关变量(此处为:idx)名称的信息。我想象的一个修复方法是用数字对所有 idx 进行编号,然后执行此操作。但是,如果可能的话,我想保留字符串名称,因为它们对我区分和理解值有很大帮助。
有人可以帮我重写 melt 以包含此信息吗?
通读 ?melt
和 Efficient reshaping vignette, I cant see how to do this directly with melt.data.table
. However, you could use pivot_longer()
from the development version of tidyr:
library(data.table)
dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2007-04-01"), by = "month")
id <- c("A1", "B2")
DT <- data.table(
time = rep(dates, 2),
idx = rep(id, each = 3),
String1 = runif(6),
String2 = runif(6)
)
DT
#> time idx String1 String2
#> 1: 2007-02-01 A1 0.6453802 0.4641508
#> 2: 2007-03-01 A1 0.1106000 0.3750282
#> 3: 2007-04-01 A1 0.6356700 0.9601759
#> 4: 2007-02-01 B2 0.9821609 0.1782534
#> 5: 2007-03-01 B2 0.4786173 0.1557481
#> 6: 2007-04-01 B2 0.7720111 0.7982246
variable_names <- names(DT[, -c("time", "idx")])
DT_long <- dcast(DT, time ~ idx, value.var = variable_names)
DT_long
#> time String1_A1 String1_B2 String2_A1 String2_B2
#> 1: 2007-02-01 0.6453802 0.9821609 0.4641508 0.1782534
#> 2: 2007-03-01 0.1106000 0.4786173 0.3750282 0.1557481
#> 3: 2007-04-01 0.6356700 0.7720111 0.9601759 0.7982246
library(tidyr) # devtools::install_github("tidyverse/tidyr")
pivot_longer(
data = DT_long,
cols = starts_with("String"),
names_sep = "_",
names_to = c(".value", "idx")
)
#> # A tibble: 6 x 4
#> time idx String1 String2
#> <date> <chr> <dbl> <dbl>
#> 1 2007-02-01 A1 0.645 0.464
#> 2 2007-02-01 B2 0.982 0.178
#> 3 2007-03-01 A1 0.111 0.375
#> 4 2007-03-01 B2 0.479 0.156
#> 5 2007-04-01 A1 0.636 0.960
#> 6 2007-04-01 B2 0.772 0.798
由 reprex package (v0.3.0)
于 2019-09-09 创建
我有一个包含大量时间段和单位的面板数据集,以及我在每个时间段和单位观察到的大量变量。
由于我想对每个单元和变量应用单变量时间序列操作,我必须将面板数据转换为宽格式(使用 data.table::dcast),以便现在每列显示一个变量对于给定的时间单位。
应用我的时间序列观察后,我想回到 "long" 面板格式(使用 data.table::melt),但是,在这里,我丢失了有关单位名称和变量的信息.由于 data.table 相当大,我担心在这里混淆数据,这就是为什么我想找到一个保留变量和值名称的熔化操作。
考虑以下示例面板数据集:
require(data.table)
dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2012-01-01"), by = "month")
id <- paste0(c("A", "B", "C"), 1:10)
DT <- data.table(
time = rep(dates, 10),
idx = rep(id, each = 60),
String1 = runif(600),
String2 = runif(600),
String3 = runif(600)
)
time idx String1 String2 String3
1: 2007-02-01 A1 0.5412122 0.23502234 0.3858354
2: 2007-03-01 A1 0.3248168 0.32884580 0.7183147
3: 2007-04-01 A1 0.4183034 0.40781723 0.7438458
4: 2007-05-01 A1 0.3597997 0.51745402 0.1660566
5: 2007-06-01 A1 0.6405351 0.96121729 0.7786483
---
596: 2011-09-01 A10 0.7896711 0.64740298 0.8285408
597: 2011-10-01 A10 0.6582652 0.83986453 0.1292342
598: 2011-11-01 A10 0.1110465 0.41741672 0.7076345
599: 2011-12-01 A10 0.5108850 0.02940229 0.9038370
600: 2012-01-01 A10 0.2605052 0.10136480 0.3881788
我正在将此面板数据集转换为宽格式。在对其应用一些时间序列操作后(此处未显示),如果没有足够的数据,我将不得不删除一些列。然后,我将数据恢复为长格式
variable_names <- names(DT[,-c("time", "idx")])
DT_long <- dcast(DT, time ~ idx, value.var = variable_names)
DT_long <- DT_long[,-(5:10)]
DT_wide <- melt(DT_long, measure = patterns("^String1", "^String2", "^String3"), value.name = variable_names, variable.name = "idx)
time idx String1 String2 String3
1: 2007-02-01 1 0.9794707 0.5290352 0.68009050
2: 2007-03-01 1 0.4016173 0.9229200 0.38652407
3: 2007-04-01 1 0.9475505 0.5956701 0.24686007
4: 2007-05-01 1 0.6465847 0.8233340 0.08008369
5: 2007-06-01 1 0.5704834 0.8232598 0.85790038
---
596: 2011-09-01 10 NA 0.5525413 0.79994190
597: 2011-10-01 10 NA 0.3895864 0.41347910
598: 2011-11-01 10 NA 0.3123646 0.44461146
599: 2011-12-01 10 NA 0.2148686 0.37609448
600: 2012-01-01 10 NA 0.7314114 0.47138012
DT_wide 现在看起来像这样,这意味着我丢失了有关变量(此处为:idx)名称的信息。我想象的一个修复方法是用数字对所有 idx 进行编号,然后执行此操作。但是,如果可能的话,我想保留字符串名称,因为它们对我区分和理解值有很大帮助。 有人可以帮我重写 melt 以包含此信息吗?
通读 ?melt
和 Efficient reshaping vignette, I cant see how to do this directly with melt.data.table
. However, you could use pivot_longer()
from the development version of tidyr:
library(data.table)
dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2007-04-01"), by = "month")
id <- c("A1", "B2")
DT <- data.table(
time = rep(dates, 2),
idx = rep(id, each = 3),
String1 = runif(6),
String2 = runif(6)
)
DT
#> time idx String1 String2
#> 1: 2007-02-01 A1 0.6453802 0.4641508
#> 2: 2007-03-01 A1 0.1106000 0.3750282
#> 3: 2007-04-01 A1 0.6356700 0.9601759
#> 4: 2007-02-01 B2 0.9821609 0.1782534
#> 5: 2007-03-01 B2 0.4786173 0.1557481
#> 6: 2007-04-01 B2 0.7720111 0.7982246
variable_names <- names(DT[, -c("time", "idx")])
DT_long <- dcast(DT, time ~ idx, value.var = variable_names)
DT_long
#> time String1_A1 String1_B2 String2_A1 String2_B2
#> 1: 2007-02-01 0.6453802 0.9821609 0.4641508 0.1782534
#> 2: 2007-03-01 0.1106000 0.4786173 0.3750282 0.1557481
#> 3: 2007-04-01 0.6356700 0.7720111 0.9601759 0.7982246
library(tidyr) # devtools::install_github("tidyverse/tidyr")
pivot_longer(
data = DT_long,
cols = starts_with("String"),
names_sep = "_",
names_to = c(".value", "idx")
)
#> # A tibble: 6 x 4
#> time idx String1 String2
#> <date> <chr> <dbl> <dbl>
#> 1 2007-02-01 A1 0.645 0.464
#> 2 2007-02-01 B2 0.982 0.178
#> 3 2007-03-01 A1 0.111 0.375
#> 4 2007-03-01 B2 0.479 0.156
#> 5 2007-04-01 A1 0.636 0.960
#> 6 2007-04-01 B2 0.772 0.798
由 reprex package (v0.3.0)
于 2019-09-09 创建