R:具有 2 个大数据集的模式匹配金融时间序列数据:
R: Pattern-matching financial time-series data with 2 large data sets:
我的问题可能比较复杂,请耐心看完
我正在处理以下案例,我有来自 2 个交易所(纽约和伦敦)的金融时间序列的两个时间数据集
两个数据集如下所示:
伦敦数据集:
Date time.second Price
2015-01-05 32417 238.2
2015-01-05 32418 238.2
2015-01-05 32421 238.2
2015-01-05 32422 238.2
2015-01-05 32423 238.2
2015-01-05 32425 238.2
2015-01-05 32427 238.2
2015-01-05 32431 238.2
2015-01-05 32435 238.47
2015-01-05 32436 238.47
纽约数据集:
NY.Date Time Price
2015-01-05 32416 1189.75
2015-01-05 32417 1189.665
2015-01-05 32418 1189.895
2015-01-05 32419 1190.15
2015-01-05 32420 1190.075
2015-01-05 32421 1190.01
2015-01-05 32422 1190.175
2015-01-05 32423 1190.12
2015-01-05 32424 1190.14
2015-01-05 32425 1190.205
2015-01-05 32426 1190.2
2015-01-05 32427 1190.33
2015-01-05 32428 1190.29
2015-01-05 32429 1190.28
2015-01-05 32430 1190.05
2015-01-05 32432 1190.04
可以看到,有3列:日期,时间(秒),价格
我想做的是利用london数据集作为参考,找到最近但更早的数据项 在纽约数据集中。
最近但更早是什么意思?我的意思是,例如,
"2015-01-01","21610","15.6871"在伦敦数据集中,我想找到纽约数据集中的数据,在同一天,并且最近但更早或相同的时间,看看我当前的程序会很有帮助]:
# I am trying to avoid using for-loop
for(i in 1:dim(london_data)[1]){ #for each row in london data set
print(i)
tempRow<-london_data[i,]
dateMatch<-(which(NY_data[,1]==tempRow[1])) # select the same date
dataNeeded<-(london_before[dateMatch,]) # subset the same date data
# find the nearest but earlier data in NY_data set
Found<-dataNeeded[which(dataNeeded[,2]<=tempRow[2]),]
# Found may be more than one row, each row is of length 3
if(length(Found)>3)
{ # Select the data, we only need "time" and "price", 2nd and 3rd
# column
# the data is in the final row of **Found**
selected<-Found[dim(Found)[1],2:3]
if(length(selected)==0) # if nothing selected, just insert 0 and 0
temp[i,]<-c(0,0)
else
temp[i,]<-selected
}
else{ # Found may only one row, of length 3
temp[i,]<-Found[2:3] # just insert what we want
}
print(paste("time is", as.numeric(selected[1]))) #Monitor the loop
}
res<-cbind(london_data,temp)
colnames(res)<-c("LondonDate","LondonTime","LondonPrice","NYTime","NYPrice")
上面列出的数据集的正确输出是**(仅部分)**:
"LondonDate","LondonTime","LondonPrice","NYTime","NYPrice"
[1,] "2015-01-05" "32417" "238.2" "32417" "1189.665"
[2,] "2015-01-05" "32418" "238.2" "32418" "1189.895"
[3,] "2015-01-05" "32421" "238.2" "32421" "1190.01"
[4,] "2015-01-05" "32422" "238.2" "32422" "1190.175"
[5,] "2015-01-05" "32423" "238.2" "32423" "1190.12"
[6,] "2015-01-05" "32425" "238.2" "32425" "1190.205"
[7,] "2015-01-05" "32427" "238.2" "32427" "1190.33"
[8,] "2015-01-05" "32431" "238.2" "32430" "1190.05"
[9,] "2015-01-05" "32435" "238.47" "32432" "1190.04"
[10,] "2015-01-05" "32436" "238.47" "32432" "1190.04"
我的问题是,伦敦数据集有超过 5,000,000 列,我试图避免 for-loop 但我仍然 至少需要一个,以上程序运行成功,但耗时约24小时。
如何避免使用 for 循环并加速程序?
我们将不胜感激。
在 @Jan Gorecki 评论的基础上使用 data.table
这是解决方案:
library(data.table)
df1 <- data.table(Date=rep("05/01/2015", 10),
time.second=c(32417, 32418, 32421, 32422, 32423, 32425, 32427, 32431, 32435, 32436),
Price=c(238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.47, 238.47))
df2 <- data.table(NY.Date=rep("05/01/2015", 16),
Time=c(32416, 32417, 32418, 32419, 32420, 32421, 32422, 32423, 32424, 32425, 32426, 32427, 32428, 32429, 32430, 32432),
Price=c(1189.75, 1189.665, 1189.895, 1190.15, 1190.075, 1190.01, 1190.175, 1190.12, 1190.14, 1190.205, 1190.2, 1190.33, 1190.29, 1190.28, 1190.05, 1190.04))
setnames(df2, c("Date", "time.second", "NYPrice"))
setkey(df1,"Date", "time.second")
setkey(df2,"Date", "time.second")
df2[, NYTime:=time.second]
df3 <- df2[df1, roll=TRUE]
df3
我的问题可能比较复杂,请耐心看完
我正在处理以下案例,我有来自 2 个交易所(纽约和伦敦)的金融时间序列的两个时间数据集
两个数据集如下所示:
伦敦数据集:
Date time.second Price
2015-01-05 32417 238.2
2015-01-05 32418 238.2
2015-01-05 32421 238.2
2015-01-05 32422 238.2
2015-01-05 32423 238.2
2015-01-05 32425 238.2
2015-01-05 32427 238.2
2015-01-05 32431 238.2
2015-01-05 32435 238.47
2015-01-05 32436 238.47
纽约数据集:
NY.Date Time Price
2015-01-05 32416 1189.75
2015-01-05 32417 1189.665
2015-01-05 32418 1189.895
2015-01-05 32419 1190.15
2015-01-05 32420 1190.075
2015-01-05 32421 1190.01
2015-01-05 32422 1190.175
2015-01-05 32423 1190.12
2015-01-05 32424 1190.14
2015-01-05 32425 1190.205
2015-01-05 32426 1190.2
2015-01-05 32427 1190.33
2015-01-05 32428 1190.29
2015-01-05 32429 1190.28
2015-01-05 32430 1190.05
2015-01-05 32432 1190.04
可以看到,有3列:日期,时间(秒),价格
我想做的是利用london数据集作为参考,找到最近但更早的数据项 在纽约数据集中。
最近但更早是什么意思?我的意思是,例如,
"2015-01-01","21610","15.6871"在伦敦数据集中,我想找到纽约数据集中的数据,在同一天,并且最近但更早或相同的时间,看看我当前的程序会很有帮助]:
# I am trying to avoid using for-loop
for(i in 1:dim(london_data)[1]){ #for each row in london data set
print(i)
tempRow<-london_data[i,]
dateMatch<-(which(NY_data[,1]==tempRow[1])) # select the same date
dataNeeded<-(london_before[dateMatch,]) # subset the same date data
# find the nearest but earlier data in NY_data set
Found<-dataNeeded[which(dataNeeded[,2]<=tempRow[2]),]
# Found may be more than one row, each row is of length 3
if(length(Found)>3)
{ # Select the data, we only need "time" and "price", 2nd and 3rd
# column
# the data is in the final row of **Found**
selected<-Found[dim(Found)[1],2:3]
if(length(selected)==0) # if nothing selected, just insert 0 and 0
temp[i,]<-c(0,0)
else
temp[i,]<-selected
}
else{ # Found may only one row, of length 3
temp[i,]<-Found[2:3] # just insert what we want
}
print(paste("time is", as.numeric(selected[1]))) #Monitor the loop
}
res<-cbind(london_data,temp)
colnames(res)<-c("LondonDate","LondonTime","LondonPrice","NYTime","NYPrice")
上面列出的数据集的正确输出是**(仅部分)**:
"LondonDate","LondonTime","LondonPrice","NYTime","NYPrice"
[1,] "2015-01-05" "32417" "238.2" "32417" "1189.665"
[2,] "2015-01-05" "32418" "238.2" "32418" "1189.895"
[3,] "2015-01-05" "32421" "238.2" "32421" "1190.01"
[4,] "2015-01-05" "32422" "238.2" "32422" "1190.175"
[5,] "2015-01-05" "32423" "238.2" "32423" "1190.12"
[6,] "2015-01-05" "32425" "238.2" "32425" "1190.205"
[7,] "2015-01-05" "32427" "238.2" "32427" "1190.33"
[8,] "2015-01-05" "32431" "238.2" "32430" "1190.05"
[9,] "2015-01-05" "32435" "238.47" "32432" "1190.04"
[10,] "2015-01-05" "32436" "238.47" "32432" "1190.04"
我的问题是,伦敦数据集有超过 5,000,000 列,我试图避免 for-loop 但我仍然 至少需要一个,以上程序运行成功,但耗时约24小时。
如何避免使用 for 循环并加速程序?
我们将不胜感激。
在 @Jan Gorecki 评论的基础上使用 data.table
这是解决方案:
library(data.table)
df1 <- data.table(Date=rep("05/01/2015", 10),
time.second=c(32417, 32418, 32421, 32422, 32423, 32425, 32427, 32431, 32435, 32436),
Price=c(238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.47, 238.47))
df2 <- data.table(NY.Date=rep("05/01/2015", 16),
Time=c(32416, 32417, 32418, 32419, 32420, 32421, 32422, 32423, 32424, 32425, 32426, 32427, 32428, 32429, 32430, 32432),
Price=c(1189.75, 1189.665, 1189.895, 1190.15, 1190.075, 1190.01, 1190.175, 1190.12, 1190.14, 1190.205, 1190.2, 1190.33, 1190.29, 1190.28, 1190.05, 1190.04))
setnames(df2, c("Date", "time.second", "NYPrice"))
setkey(df1,"Date", "time.second")
setkey(df2,"Date", "time.second")
df2[, NYTime:=time.second]
df3 <- df2[df1, roll=TRUE]
df3