对于 R 中的给定 ID，如何使用矢量化函数将不同数据框中的所有值相乘？

Question

我有一个包含 750,000 个 ID 的庞大数据集，为此我想通过将给定 ID 的所有值相乘来将月度值汇总为年度值。 ID 由标识号和年份组成。

我要提取的数据：

ID	monthly value
1 - 1997	Product of Monthly Values in Year 1997
1 - 1998	Product of Monthly Values in Year 1998
1 - 1999	Product of Monthly Values in Year 1999
...	...
2 - 1997	Product of Monthly Values in Year 1997
2 - 1998	Product of Monthly Values in Year 1998
2 - 1999	Product of Monthly Values in Year 1999
...	...

作为来源的数据集：

ID	monthly value
1 - 1997	Monthly Value 1 in Year 1997
1 - 1997	Monthly Value 2 in Year 1997
1 - 1997	Monthly Value 3 in Year 1997
...	...
2 - 1997	Monthly Value 1 in Year 1997
2 - 1997	Monthly Value 2 in Year 1997
2 - 1997	Monthly Value 3 in Year 1997
...	...

我写了一个for循环，10个ID大约需要0.74秒，这很慢。整个数据运行通过大约需要 15 个小时。 for 循环将给定 ID 的所有月度值相乘，并将其存储在单独的数据框中。

for (i in 1:nrow(yearlyreturns)){
  
  yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
  yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
  
}

我不知道如何从这里转到矢量化函数，这样花费的时间更少。

这可以在 R 中实现吗？

Answer 1

像这样：

library(dplyr)

df %>% 
  mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\s\d', 'Values'))) %>% 
  group_by(ID, monthly_value) %>% 
  summarise()

  ID       monthly_value                         
  <chr>    <chr>                                 
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997

数据：

structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997", 
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997", 
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997", 
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997", 
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA, 
-6L))

Answer 2

基于 for 循环代码，这可能是通过连接完成的

library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount") 
     := .(prod(change) -1, .N), on = .(ID), by = .EACHI]

Answer 3

除了之前最优秀的答案 - here's a link 与早期的 post 比较了 10 种常用的按组计算均值的方法。 Data.table 基于解决方案绝对是可行的方法 - 特别是对于具有数百万行的数据集。除非您正在写入单个输出文件 - 我不确定为什么这会花费数小时而不是数分钟。

对于 R 中的给定 ID，如何使用矢量化函数将不同数据框中的所有值相乘？

How can I use a vectorised function to multiply all values in a different data frame for a given ID in R?

for-loop

r

aggregation