使用月份分布重新格式化数据框并按 R 中的日历顺序排序

Question

我在下面给出了 data.frame。我正在尝试将它从长格式移动到宽格式。使用扩展列作为日期。使用 tidyr 包中的传播函数会出现两个问题：

数据填NA
月份按字母顺序排列

那么我该如何从

30-Apr-2015 632.95
28-May-2015 532.95
25-Jun-2015 232.95

到

30-Apr-2015 28-May-2015 25-Jun-2015
632.95      532.95      232.95

相反，我最终在

30-Apr-2015 25-Jun-2015 28-May-2015 
632.95      NA      232.95
NA          232.95  NA
NA          NA      532.95

实际日期无关紧要，但它们的相对顺序很重要，即最近的月份数据应按顺序排在第一列，然后是其他两个月的数据。这是必要的，因为我在结果

上使用 rbind

我试过的代码

data = tidyr::spread(data, key = EXPIRY_DT, value = CHG_IN_OI)
colnames(data)[3:5] = c('Month1', 'Month2', 'Month3')

data.frame如下：

data = structure(list(SYMBOL = c("A", "A", "A", "B", "B", "B", "C", 
"C", "C", "D", "D", "D"), EXPIRY_DT = c("30-Apr-2015", "28-May-2015", 
"25-Jun-2015", "30-Apr-2015", "28-May-2015", "25-Jun-2015", "30-Apr-2015", 
"28-May-2015", "25-Jun-2015", "30-Apr-2015", "28-May-2015", "25-Jun-2015"
), OPEN = c(1750, 1789, 0, 1627.5, 1653.3, 0, 632.95, 644.1, 
0, 317.8, 319.5, 0), HIGH = c(1788.05, 1795, 0, 1656.5, 1653.3, 
0, 646.4, 650.5, 0, 324.6, 326.65, 0), LOW = c(1746, 1760, 0, 
1627.5, 1645.45, 0, 629.65, 635, 0, 315.85, 318.4, 0), CLOSE = c(1782.3, 
1791.85, 1695.1, 1642.95, 1646.75, 1613.9, 640.85, 644.35, 614.6, 
320.55, 322.35, 310.85), SETTLE_PR = c(1782.3, 1791.85, 1804.8, 
1642.95, 1653.85, 1664.35, 640.85, 644.35, 649.1, 320.55, 322.35, 
325.35), CONTRACTS = c(1469L, 78L, 0L, 2638L, 14L, 0L, 4964L, 
181L, 0L, 3416L, 82L, 0L), VALUE = c(6496.96, 347.91, 0, 10830.05, 
57.68, 0, 15869.41, 583.38, 0, 10969.31, 264.93, 0), OPEN_INT = c(1353750L, 
8500L, 0L, 1377250L, 17000L, 0L, 6264000L, 98000L, 0L, 8228000L, 
216000L, 0L), CHG_IN_OI = c(15250L, 1250L, 0L, -21000L, 1500L, 
0L, 73500L, 6000L, 0L, -192000L, 13000L, 0L), TIMESTAMP = c("10-APR-2015", 
"10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", 
"10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", "10-APR-2015", 
"10-APR-2015")), .Names = c("SYMBOL", "EXPIRY_DT", "OPEN", "HIGH", 
"LOW", "CLOSE", "SETTLE_PR", "CONTRACTS", "VALUE", "OPEN_INT", 
"CHG_IN_OI", "TIMESTAMP"), row.names = 40:51, class = "data.frame")

感谢阅读。

Edit:

在@akrun 添加预期输出的评论之后。因为每个日期的值都不同，即需要将每个月的数据一个接一个地放置，列名将附加字符串 'Month1/2/3' 而不是实际日期。希望对您有所帮助。

output = structure(list(SYMBOL = c("A", "B", "C", "D"), TIMESTAMP = c("10-Apr-15", 
"10-Apr-15", "10-Apr-15", "10-Apr-15"), OPEN.Month1 = c(1750, 
1627.5, 632.95, 317.8), HIGH.Month1 = c(1788.05, 1656.5, 646.4, 
324.6), LOW.Month1 = c(1746, 1627.5, 629.65, 315.85), CLOSE.Month1 = c(1782.3, 
1642.95, 640.85, 320.55), SETTLE_PR.Month1 = c(1782.3, 1642.95, 
640.85, 320.55), CONTRACTS.Month1 = c(1469L, 2638L, 4964L, 3416L
), VALUE.Month1 = c(6496.96, 10830.05, 15869.41, 10969.31), OPEN_INT.Month1 = c(1353750L, 
1377250L, 6264000L, 8228000L), CHG_IN_OI.Month1 = c(15250L, -21000L, 
73500L, -192000L), OPEN.Month2 = c(1789, 1653.3, 644.1, 319.5
), HIGH.Month2 = c(1795, 1653.3, 650.5, 326.65), LOW.Month2 = c(1760, 
1645.45, 635, 318.4), CLOSE.Month2 = c(1791.85, 1646.75, 644.35, 
322.35), SETTLE_PR.Month2 = c(1791.85, 1653.85, 644.35, 322.35
), CONTRACTS.Month2 = c(78L, 14L, 181L, 82L), VALUE.Month2 = c(347.91, 
57.68, 583.38, 264.93), OPEN_INT.Month2 = c(8500L, 17000L, 98000L, 
216000L), CHG_IN_OI.Month2 = c(1250L, 1500L, 6000L, 13000L), 
    OPEN.Month3 = c(0L, 0L, 0L, 0L), HIGH.Month3 = c(0L, 0L, 
    0L, 0L), LOW.Month3 = c(0L, 0L, 0L, 0L), CLOSE.Month3 = c(1695.1, 
    1613.9, 614.6, 310.85), SETTLE_PR.Month3 = c(1804.8, 1664.35, 
    649.1, 325.35), CONTRACTS.Month3 = c(0L, 0L, 0L, 0L), VALUE.Month3 = c(0L, 
    0L, 0L, 0L), OPEN_INT.Month3 = c(0L, 0L, 0L, 0L), CHG_IN_OI.Month3 = c(0L, 
    0L, 0L, 0L)), .Names = c("SYMBOL", "TIMESTAMP", "OPEN.Month1", 
"HIGH.Month1", "LOW.Month1", "CLOSE.Month1", "SETTLE_PR.Month1", 
"CONTRACTS.Month1", "VALUE.Month1", "OPEN_INT.Month1", "CHG_IN_OI.Month1", 
"OPEN.Month2", "HIGH.Month2", "LOW.Month2", "CLOSE.Month2", "SETTLE_PR.Month2", 
"CONTRACTS.Month2", "VALUE.Month2", "OPEN_INT.Month2", "CHG_IN_OI.Month2", 
"OPEN.Month3", "HIGH.Month3", "LOW.Month3", "CLOSE.Month3", "SETTLE_PR.Month3", 
"CONTRACTS.Month3", "VALUE.Month3", "OPEN_INT.Month3", "CHG_IN_OI.Month3"
), class = "data.frame", row.names = c(NA, -4L))

Answer 1

我们可以使用 data.table 的 devel 版本，即。 'v1.9.5' 可以取多个 "value.vars"。安装开发版本的说明是 here.

将 'data.frame' 更改为 'data.table' (setDT(data))。通过粘贴 'Month' 和每个 "SYMBOL" 的行号来创建一个 "Month" 列。然后，我们可以使用 dcast，将 value.var 指定为列 '3:11'。

library(data.table)
res <- dcast(setDT(data)[, Month:=paste0('Month', 1:.N), by=SYMBOL],
                 SYMBOL+TIMESTAMP~Month, value.var=names(data)[3:11])

如果我们需要将列名更改为 'output' 中的特定格式，请使用 setnames。我按照预期结果 ('output') 重新排列了列的顺序，并将 data.table 更改为 data.frame (setDF)

setnames(res, sub('([^_]+)_(.*)', '\2.\1', colnames(res)))
res1 <- setDF(res[,names(output), with=FALSE])
res1
#  SYMBOL   TIMESTAMP OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1
#1      A 10-APR-2015     1750.00     1788.05    1746.00      1782.30
#2      B 10-APR-2015     1627.50     1656.50    1627.50      1642.95
#3      C 10-APR-2015      632.95      646.40     629.65       640.85
#4      D 10-APR-2015      317.80      324.60     315.85       320.55
#  SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1
#1          1782.30             1469      6496.96         1353750
#2          1642.95             2638     10830.05         1377250
#3           640.85             4964     15869.41         6264000
#4           320.55             3416     10969.31         8228000
#  CHG_IN_OI.Month1 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2
#1            15250      1789.0     1795.00    1760.00      1791.85
#2           -21000      1653.3     1653.30    1645.45      1646.75
#3            73500       644.1      650.50     635.00       644.35
#4          -192000       319.5      326.65     318.40       322.35
#  SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2
#1          1791.85               78       347.91            8500
#2          1653.85               14        57.68           17000
#3           644.35              181       583.38           98000
#4           322.35               82       264.93          216000
#  CHG_IN_OI.Month2 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3  
#1             1250           0           0          0      1695.10
#2             1500           0           0          0      1613.90
#3             6000           0           0          0       614.60
#4            13000           0           0          0       310.85
#  SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3
#1          1804.80                0            0               0
#2          1664.35                0            0               0
#3           649.10                0            0               0
#4           325.35                0            0               0
#  CHG_IN_OI.Month3
#1                0
#2                0
#3                0
#4                0

'output' 中的 TIMESTAMP 列格式不同。更改了 'res1' 中的格式，它与预期的输出相同。

res1$TIMESTAMP <- format(as.Date(res1$TIMESTAMP, '%d-%b-%Y'), '%d-%b-%y')
all.equal(output, res1)
#[1] TRUE

或者我们可以使用 base R 中的 reshape，它确实有多个值列。就像我们之前创建一个序列一样，这里我们可以使用 ave 创建 'MONTH' 列并将其用作 reshape.

中的 timevar

 data$MONTH <- with(data, paste0('MONTH', ave(seq_along(SYMBOL), 
                    SYMBOL, FUN=seq_along)))
 res2 <- reshape(data[-2], idvar=c('SYMBOL', 'TIMESTAMP'), 
                          timevar='MONTH', direction='wide')

Answer 2

非常棘手的问题。我设计了一个非常接近您的示例输出的解决方案；您之后应该能够清理小差异（请参阅我的回答结尾以获取差异摘要）。

假设

首先，让我从我的假设开始：

输入 data.frame data 已经根据 EXPIRY_DT 正确排序（每个 SYMBOL 独立）。您的示例输入满足此假设。现在，作为一般建议，您应该尝试始终使用 ISO 8601 for date formats, which naturally sort lexicographically, and would naturally allow you to coerce to Date format in R. Given your input date formats, if you wanted to guarantee proper order, you would have to call as.Date() and pass the input format, and then make a call to order()。我没有将其包含在我的代码中，而是假设数据已经订购。
因为您的示例输出似乎统一了每个 SYMBOL 的 TIMESTAMP 的所有值，我假设这两列包含数据的多列主键。如果这不正确，您只需将我在代码中定义的 keys 变量更改为不包含 TIMESTAMP。但如果是这种情况，那么您将在输出中获得额外的 TIMESTAMP.Month{mnum} 列（如果需要，您可以在之后删除）。

代码

keys <- c('SYMBOL','TIMESTAMP');
mnum <- ave(1:nrow(data), data[,keys], FUN=seq_along );
mnum;
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
mdata <- lapply(1:max(mnum), function(x) setNames(data[mnum==x,],ifelse(names(data)%in%keys,names(data),paste0(names(data),'.Month',x))) );
mdata;
## [[1]]
##    SYMBOL EXPIRY_DT.Month1 OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1 SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1 CHG_IN_OI.Month1   TIMESTAMP
## 40      A      30-Apr-2015     1750.00     1788.05    1746.00      1782.30          1782.30             1469      6496.96         1353750            15250 10-APR-2015
## 43      B      30-Apr-2015     1627.50     1656.50    1627.50      1642.95          1642.95             2638     10830.05         1377250           -21000 10-APR-2015
## 46      C      30-Apr-2015      632.95      646.40     629.65       640.85           640.85             4964     15869.41         6264000            73500 10-APR-2015
## 49      D      30-Apr-2015      317.80      324.60     315.85       320.55           320.55             3416     10969.31         8228000          -192000 10-APR-2015
## 
## [[2]]
##    SYMBOL EXPIRY_DT.Month2 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2 SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2 CHG_IN_OI.Month2   TIMESTAMP
## 41      A      28-May-2015      1789.0     1795.00    1760.00      1791.85          1791.85               78       347.91            8500             1250 10-APR-2015
## 44      B      28-May-2015      1653.3     1653.30    1645.45      1646.75          1653.85               14        57.68           17000             1500 10-APR-2015
## 47      C      28-May-2015       644.1      650.50     635.00       644.35           644.35              181       583.38           98000             6000 10-APR-2015
## 50      D      28-May-2015       319.5      326.65     318.40       322.35           322.35               82       264.93          216000            13000 10-APR-2015
## 
## [[3]]
##    SYMBOL EXPIRY_DT.Month3 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3 SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3 CHG_IN_OI.Month3   TIMESTAMP
## 42      A      25-Jun-2015           0           0          0      1695.10          1804.80                0            0               0                0 10-APR-2015
## 45      B      25-Jun-2015           0           0          0      1613.90          1664.35                0            0               0                0 10-APR-2015
## 48      C      25-Jun-2015           0           0          0       614.60           649.10                0            0               0                0 10-APR-2015
## 51      D      25-Jun-2015           0           0          0       310.85           325.35                0            0               0                0 10-APR-2015
## 
res <- Reduce(function(x,y) merge(x,y,by=keys,all=T), mdata );
res;
##   SYMBOL   TIMESTAMP EXPIRY_DT.Month1 OPEN.Month1 HIGH.Month1 LOW.Month1 CLOSE.Month1 SETTLE_PR.Month1 CONTRACTS.Month1 VALUE.Month1 OPEN_INT.Month1 CHG_IN_OI.Month1 EXPIRY_DT.Month2 OPEN.Month2 HIGH.Month2 LOW.Month2 CLOSE.Month2 SETTLE_PR.Month2 CONTRACTS.Month2 VALUE.Month2 OPEN_INT.Month2 CHG_IN_OI.Month2 EXPIRY_DT.Month3 OPEN.Month3 HIGH.Month3 LOW.Month3 CLOSE.Month3 SETTLE_PR.Month3 CONTRACTS.Month3 VALUE.Month3 OPEN_INT.Month3 CHG_IN_OI.Month3
## 1      A 10-APR-2015      30-Apr-2015     1750.00     1788.05    1746.00      1782.30          1782.30             1469      6496.96         1353750            15250      28-May-2015      1789.0     1795.00    1760.00      1791.85          1791.85               78       347.91            8500             1250      25-Jun-2015           0           0          0      1695.10          1804.80                0            0               0                0
## 2      B 10-APR-2015      30-Apr-2015     1627.50     1656.50    1627.50      1642.95          1642.95             2638     10830.05         1377250           -21000      28-May-2015      1653.3     1653.30    1645.45      1646.75          1653.85               14        57.68           17000             1500      25-Jun-2015           0           0          0      1613.90          1664.35                0            0               0                0
## 3      C 10-APR-2015      30-Apr-2015      632.95      646.40     629.65       640.85           640.85             4964     15869.41         6264000            73500      28-May-2015       644.1      650.50     635.00       644.35           644.35              181       583.38           98000             6000      25-Jun-2015           0           0          0       614.60           649.10                0            0               0                0
## 4      D 10-APR-2015      30-Apr-2015      317.80      324.60     315.85       320.55           320.55             3416     10969.31         8228000          -192000      28-May-2015       319.5      326.65     318.40       322.35           322.35               82       264.93          216000            13000      25-Jun-2015           0           0          0       310.85           325.35                0            0               0                0

说明

如您所见，我的解决方案的核心涉及按月份将输入数据拆分为单独的 data.frames，这使得为每个拆分独立地向所有非键列添加后缀成为可能，然后重复调用 merge() 将它们合并在一起。

mnum 向量代表 "month number"。您可以将其视为输入 data 对象的一种 "detached" 列；它表示 data 中每一行所属的主键组中的月份数。我对每个组使用一次 ave() to call seq_along()，它生成一个长度等于组大小（即组中的行数）的连续整数向量，ave() 映射回组行的位置在原始 data 对象中。

mdata 对象是 data.frame 的列表，其中每个组件代表一个月份。实际提取具有特定月份编号的行是通过简单的逻辑索引操作完成的：

data[mnum==x,]

其中 x 是 mnum 元素，由 lapply(). The suffixing of non-key column names is done using setNames() 迭代 1:max(mnum)，导出替换列名称如下：

ifelse(names(data)%in%keys,names(data),paste0(names(data),'.Month',x))

以上保留键列的名称不变，但将 '.Month{mnum}' 附加到所有非键列的名称。

最后，所有月份数字拆分必须合并回一个 data.frame。我以为我可以使用对 merge() 的单个调用（可能需要 do.call()) to do this, but was disappointed to discover that it only takes two arguments to merge, x and y (also see Simultaneously merge multiple data.frames in a list). Thus, I needed to call Reduce() 的一点帮助来实现重复调用。如果您使用不同的符号， all=T 参数将很重要有不同数量的到期日；那么 "short" 符号将不会出现在最终合并的 RHS 上，因此如果 all=T 未通过则将被删除。

差异

我的输出与您的示例输出不完全匹配。以下是差异：

您的示例输出似乎已将 TIMESTAMP 列的格式与输入中的格式进行了更改，例如，10-APR-2015 更改为 10-Apr-15。我的代码没有触及 TIMESTAMP.
您的示例输出缺少 EXPIRY_DT 列，我的解决方案保留在它们的后缀 EXPIRY_DT.Month1、EXPIRY_DT.Month2 等名称下。如果需要，您可以随后使用 grep() on names() and negative indexing 删除这些列。

Answer 3

记得 aggregate() 有一个重载 data.frames 可以用来实现这个要求。列名和顺序不会完全符合您的要求，但它们肯定合乎逻辑且可用（并且可以在之后进行调整）：

keys <- c('SYMBOL','TIMESTAMP');
aggregate(data[,!(names(data)%in%keys)],data[,names(data)%in%keys],identity);
##   SYMBOL   TIMESTAMP EXPIRY_DT.1 EXPIRY_DT.2 EXPIRY_DT.3  OPEN.1  OPEN.2  OPEN.3  HIGH.1  HIGH.2  HIGH.3   LOW.1   LOW.2   LOW.3 CLOSE.1 CLOSE.2 CLOSE.3 SETTLE_PR.1 SETTLE_PR.2 SETTLE_PR.3 CONTRACTS.1 CONTRACTS.2 CONTRACTS.3  VALUE.1  VALUE.2  VALUE.3 OPEN_INT.1 OPEN_INT.2 OPEN_INT.3 CHG_IN_OI.1 CHG_IN_OI.2 CHG_IN_OI.3
## 1      A 10-APR-2015 30-Apr-2015 28-May-2015 25-Jun-2015 1750.00 1789.00    0.00 1788.05 1795.00    0.00 1746.00 1760.00    0.00 1782.30 1791.85 1695.10     1782.30     1791.85     1804.80        1469          78           0  6496.96   347.91     0.00    1353750       8500          0       15250        1250           0
## 2      B 10-APR-2015 30-Apr-2015 28-May-2015 25-Jun-2015 1627.50 1653.30    0.00 1656.50 1653.30    0.00 1627.50 1645.45    0.00 1642.95 1646.75 1613.90     1642.95     1653.85     1664.35        2638          14           0 10830.05    57.68     0.00    1377250      17000          0      -21000        1500           0
## 3      C 10-APR-2015 30-Apr-2015 28-May-2015 25-Jun-2015  632.95  644.10    0.00  646.40  650.50    0.00  629.65  635.00    0.00  640.85  644.35  614.60      640.85      644.35      649.10        4964         181           0 15869.41   583.38     0.00    6264000      98000          0       73500        6000           0
## 4      D 10-APR-2015 30-Apr-2015 28-May-2015 25-Jun-2015  317.80  319.50    0.00  324.60  326.65    0.00  315.85  318.40    0.00  320.55  322.35  310.85      320.55      322.35      325.35        3416          82           0 10969.31   264.93     0.00    8228000     216000          0     -192000       13000           0

基于 R 的干净、简单的解决方案！

编辑： 感谢@Frash 指出上述问题 "solution"。可以通过如下包装 aggregate() 来纠正这种情况：

do.call(data.frame,...);

这是因为data.frame() automatically expands matrices to independent columns in the resulting data.frame (except for matrices of class "model.matrix" and those protected by I()).

使用月份分布重新格式化数据框并按 R 中的日历顺序排序

Reformat data frame using with months spread and ordered by their calender order in R

r

dataframe

tidyr

假设

代码

说明

差异