检查每一天的间隔值，并创建包含 R 中哪个间隔信息的新矩阵

Question

我正在处理一个包含 2014-01-01 至 2019-12-31 期间 700 多种加密货币价格信息的数据集，称为 price.daily

            Bitcoin     Ethereum    XRP      Bitcoin.SV    Stellar    ...
   ...
2018-01-01  13657.20    772.64      2.39     NA            0.480008
2018-01-02  14982.10    884.44      2.48     NA            0.564766
2018-01-03  15201.00    962.72      3.11     NA            0.896227
   ...

然后我每天使用 sapply 计算分位数，正如另一个问题所建议的那样，这很好

col.daily <- seq(1,length(price.daily$Bitcoin))
quantile.daily = sapply(col.daily, function(y) {quantile(x = unlist(price.daily[y,] ), seq(0,1, length=6),na.rm = TRUE )})
quantile.daily.t = t(quantile.daily)
rownames(quantile.daily.t) = rownames(price.daily)

从中我得到我的间隔的数字

             0%         20%         40%         60%         80%     100%
   ...
2018-01-01   2.60e-05   0.1681120   0.7189722   2.3060000   9.392   13657.20
2018-01-02   3.40e-05   0.1946376   0.7232178   2.4240000   10.092  14982.10
2018-01-03   3.80e-05   0.1982452   0.7771724   2.4820000   10.054  15201.00
   ...

然后我想做的是每天获取每种加密货币的价格，并检查它位于哪个区间内，如果没有可用数据，则创建一个包含数字 1 到 5 和 NA 的新矩阵。应该是

            Bitcoin   Ethereum    XRP     Bitcoin.SV    Stellar   ...
   ...
2018-01-01  5         5           4       NA            2
2018-01-02  5         5           4       NA            2
2018-01-03  5         5           4       NA            3
   ...

我想我也可以使用 sapply?

我的数据样本使用 dput(head(price.daily)) 作为我的 price.daily 数据

structure(list(Bitcoin = c(771.4, 802.39, 818.72, 859.51, 933.53, 
953.29), Ethereum = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), XRP = c(0.026944, 0.028726, 0.027627, 0.028129, 
0.02523, 0.0257), Bitcoin.Cash = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), row.names = c("2014-01-01", 
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"
), class = "data.frame")

和分位数

structure(c(0.00044, 0.000353, 0.000303, 0.000301, 0.000271, 
0.00001, 0.0330034, 0.0319948, 0.0327684, 0.0318646, 0.0274614, 
0.0237276, 0.161692, 0.1793948, 0.163744, 0.1610448, 0.1579238, 
0.0728448, 3.014, 3.728, 3.85, 3.87, 3.814, 2.54200000000001, 
6.036, 7.578, 7.14, 7.434, 7.474, 7.188, 771.4, 802.39, 818.72, 
859.51, 933.53, 953.29), .Dim = c(6L, 6L), .Dimnames = list(c("2014-01-01", 
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"
), c("0%", "20%", "40%", "60%", "80%", "100%")))

Answer 1

函数findInterval 正是您所需要的。唯一的困难在于将其应用于正确的数据。

带循环的简单解决方案：

result_loop = price.daily
for (i in seq_len(nrow(price.daily))) {
  result_loop[i, ] = findInterval(price.daily[i, ], quantile.daily[i, ])
}

没有循环的替代解决方案：

combined = cbind(price.daily, quantile.daily)
result_alternative = as.data.frame(t(apply(combined, 1, function(x) findInterval(x[1:ncol(price.daily)], x[(1 + ncol(price.daily)):ncol(combined)]))))
colnames(result_alternative) = colnames(price.daily)

第二个解决方案（受 this answer to a similar question 启发）有一些额外的问题，例如 combined 变量的内存开销。即使不是这样，我仍然会使用第一种解决方案。使用语言结构来避免循环可能很诱人，但在许多情况下它会使调试和维护变得更加困难。

作为旁注：结果可以是矩阵而不是数据框，但由于 price.daily 是（不必要地）作为数据框给出的，我选择使用相同的 class结果。

检查每一天的间隔值，并创建包含 R 中哪个间隔信息的新矩阵

Check which interval values lies within for each day, and create new matrix containing information of which interval in R

r

intervals

sapply