如何在 R 中实现支持向量机
How to implement support vector machine in R
我是机器学习的新手(不是数学家)并且我自己通过视频和书籍学习 ML。我对朴素贝叶斯、svm、决策树等算法有基本的了解,我每天都在使用 ML 为股市建模 return。我想为我的 ML 使用非线性回归算法,所以选择 支持向量机回归 因为它很受欢迎。我使用交易日和 EMA 差异作为特征向量 (X),价格变化作为标签 (Y)。下面是我的代码
library("quantmod")
#Adding libraries
library("lubridate")
#Makes it easier to work with the dates
library("e1071")
#Gives us access to the svm
stockData <- new.env()
tickers <- 'AAPL'
startDate = as.Date("2015-11-01")
# The beginning of the date range we want to look at
symbol = getSymbols(tickers,from=startDate, auto.assign=F)
# Retrieving Apple’s daily OHLCV from Yahoo Finance
DayofWeek<-wday(symbol, label=TRUE)
#Find the day of the week
Class<- Cl(symbol) - Op(symbol)
#price change
EMA5<-EMA(Cl(symbol),n = 5)
#We are calculating a 5-period EMA off the open price
EMA10<-EMA(Cl(symbol),n = 10)
#Then the 10-period EMA, also off the open price
EMACross <- EMA5 - EMA10
#Positive values correspond to the 5-period EMA being above the 10-period EMA
EMACross<-round(EMACross,2)
DataSet2<-data.frame(DayofWeek,EMACross, Class)
DataSet2<-DataSet2[-c(1:10),]
#We need to remove the instances where the 10-period moving average is still being calculated
m<-nrow(DataSet2)
n<-round((nrow(DataSet2)*2)/3)
TrainingSet<-DataSet2[1:n,]
#We will use ⅔ of the data to train the model
TestSet<-DataSet2[(n+1):m,]
#And ⅓ to test it on unseen data
EMACrossModel<-svm( Cl(symbol) ~ ., data=TrainingSet)
summary(EMACrossModel)
pred<-predict(EMACrossModel,TestSet[,-3])
当我运行上面的代码时,我得到了这个错误
> EMACrossModel<-svm( Cl(symbol) ~ ., data=TrainingSet)
Error in model.frame.default(formula = Cl(symbol) ~ ., data = TrainingSet, :
variable lengths differ (found for 'DayofWeek')
所以我的问题是(请原谅我有不止一个问题)
1) How to solve my above problem?
2) Can in use both qualitative (eg: mon,tue,wed etc) and quantitative(eg 1.0,0.1,100 etc) data together in SVM regressions
3) How can i plot my above results with SVM decision
boundaries?
已编辑
数据集2
DayofWeek EMA AAPL.Close
2015-11-16 Mon -2.77 2.800003
2015-11-17 Tues -2.51 -1.229996
2015-11-18 Wed -1.67 1.529999
2015-11-19 Thurs -0.89 1.140000
2015-11-20 Fri -0.32 0.100006
2015-11-23 Mon -0.23 -1.519997
2015-11-24 Tues 0.00 1.549995
2015-11-25 Wed 0.00 -1.180000
2015-11-27 Fri -0.03 -0.480003
2015-11-30 Mon 0.02 0.310005
2015-12-01 Tues -0.09 -1.410004
2015-12-02 Wed -0.31 -1.059997
2015-12-03 Thurs -0.57 -1.350006
2015-12-04 Fri -0.10 3.739998
2015-12-07 Mon 0.05 -0.700004
2015-12-08 Tues 0.12 0.710006
2015-12-09 Wed -0.24 -2.019996
2015-12-10 Thurs -0.35 0.129997
2015-12-11 Fri -0.83 -2.010002
2015-12-14 Mon -1.15 0.300003
2015-12-15 Tues -1.56 -1.450004
2015-12-16 Wed -1.56 0.269996
2015-12-17 Thurs -1.82 -3.039994
2015-12-18 Fri -2.30 -2.880005
2015-12-21 Mon -2.23 0.050003
2015-12-22 Tues -2.07 -0.169999
2015-12-23 Wed -1.64 1.340004
2015-12-24 Thurs -1.40 -0.970001
2015-12-28 Mon -1.37 -0.769996
2015-12-29 Tues -0.98 1.779999
2015-12-30 Wed -0.92 -1.260002
修改后的以下代码 运行s 但给出不同的答案
这些是修改
EMACrossModel<-ksvm( Cl(symbol[1:n]) ~ ., data=TrainingSet,kernel="rbfdot",C=10) #kernlab libraries
pred<-predict(EMACrossModel,TestSet)
结果
> EMACrossModel
Support Vector Machine object of class "ksvm"
SV type: eps-svr (regression)
parameter : epsilon = 0.1 cost C = 10
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.294836572886287
Number of Support Vectors : 17
Objective Function Value : -49.1082
Training error : 0.138329
> pred
[,1]
[1,] 119.7267
[2,] 119.9733
[3,] 120.7236
[4,] 121.8324
[5,] 121.5632
[6,] 121.4652
[7,] 119.6438
[8,] 119.6962
[9,] 119.0775
[10,] 116.4956
我除了预测结果是这样的
[,1]
-1.327996
1.229939
-1.130000
0.100006
-1.519997
-0.480003
1.310005
-1.410004
-1.059997
1.350006
-2.739998
1.700004
我的猜测是我当前的代码将股票价格而不是价格变化作为 Y 并使用它来建模 EMACrossModel。我对吗?如果是,我该如何解决这个问题。
关于问题一
您通过删除一些数据形成了您的训练集。但是,您没有限制您的符号集:
EMACrossModel<-svm( Cl(symbol[1:n]) ~ ., data=TrainingSet)
我刚刚意识到你更可能想要的是:
EMACrossModel<-svm( AAPL.Close ~ ., data=TrainingSet)
一般公式:
Cl(符号[1:n]) ~ .
定义学到了什么。目前是 "symbol"。但是,我假设您要预测列 AAPL.Close。
公式是 R 中的一般概念 (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html)。花一点时间来理解这些是值得的。
编辑
根据您的上述评论,这似乎得到了证实。结果是
-0.1926745
0.3578645
0.1830046
0.6362871
-0.3760084
-0.1443156
0.2615674
0.2589130
-0.4779677
-0.5928780
编辑结束
关于问题二,它取决于实现(和内核),但这里似乎是这样。
关于你的第三个问题。 E1071 封装包含示例:
data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)
编辑
我刚刚意识到这个绘图函数只适用于分类器而不适用于回归。但是,您可以轻松构建自己的绘图函数。为了简单起见,我先将星期几转换成数字。
DataSet2$DayofWeek <- as.numeric(DataSet2$DayofWeek)
并重建分类器
之后你可以通过
可视化分类器
### plot the results of the support vector machine by
# first generating a grid covering the data range
#generate a sequence of 100 numbers between the minimum and maximum of DataSet2EMA
plot.ema.vec <- seq(min(DataSet2$EMA),max(DataSet2$EMA),(max(DataSet2$EMA)-min(DataSet2$EMA))/100)
#generate a "grid" of artificial data points 1:7 are the weekdays
# can be replaced by c("Mon",...,"Sun")
datagrid <- expand.grid(1:7,plot.ema.vec)
# set the names of the grid according to the dataset s.t. the classifier can use the data as input
names(datagrid) <- names(DataSet2[,1:2])
#calculate the predictions of the classifier
grid.pred <- predict(EMACrossModel,datagrid)
# normalise the prediction in [0,1] range to use it as colors
cols <- (grid.pred-min(grid.pred))/(max(grid.pred)-min(grid.pred))
# plot the decisions for the data
plot(datagrid$DayofWeek,datagrid$EMA , col=rgb(blue=cols,red=1-cols,green=0))
我是机器学习的新手(不是数学家)并且我自己通过视频和书籍学习 ML。我对朴素贝叶斯、svm、决策树等算法有基本的了解,我每天都在使用 ML 为股市建模 return。我想为我的 ML 使用非线性回归算法,所以选择 支持向量机回归 因为它很受欢迎。我使用交易日和 EMA 差异作为特征向量 (X),价格变化作为标签 (Y)。下面是我的代码
library("quantmod")
#Adding libraries
library("lubridate")
#Makes it easier to work with the dates
library("e1071")
#Gives us access to the svm
stockData <- new.env()
tickers <- 'AAPL'
startDate = as.Date("2015-11-01")
# The beginning of the date range we want to look at
symbol = getSymbols(tickers,from=startDate, auto.assign=F)
# Retrieving Apple’s daily OHLCV from Yahoo Finance
DayofWeek<-wday(symbol, label=TRUE)
#Find the day of the week
Class<- Cl(symbol) - Op(symbol)
#price change
EMA5<-EMA(Cl(symbol),n = 5)
#We are calculating a 5-period EMA off the open price
EMA10<-EMA(Cl(symbol),n = 10)
#Then the 10-period EMA, also off the open price
EMACross <- EMA5 - EMA10
#Positive values correspond to the 5-period EMA being above the 10-period EMA
EMACross<-round(EMACross,2)
DataSet2<-data.frame(DayofWeek,EMACross, Class)
DataSet2<-DataSet2[-c(1:10),]
#We need to remove the instances where the 10-period moving average is still being calculated
m<-nrow(DataSet2)
n<-round((nrow(DataSet2)*2)/3)
TrainingSet<-DataSet2[1:n,]
#We will use ⅔ of the data to train the model
TestSet<-DataSet2[(n+1):m,]
#And ⅓ to test it on unseen data
EMACrossModel<-svm( Cl(symbol) ~ ., data=TrainingSet)
summary(EMACrossModel)
pred<-predict(EMACrossModel,TestSet[,-3])
当我运行上面的代码时,我得到了这个错误
> EMACrossModel<-svm( Cl(symbol) ~ ., data=TrainingSet)
Error in model.frame.default(formula = Cl(symbol) ~ ., data = TrainingSet, :
variable lengths differ (found for 'DayofWeek')
所以我的问题是(请原谅我有不止一个问题)
1) How to solve my above problem?
2) Can in use both qualitative (eg: mon,tue,wed etc) and quantitative(eg 1.0,0.1,100 etc) data together in SVM regressions
3) How can i plot my above results with SVM decision
boundaries?
已编辑
数据集2
DayofWeek EMA AAPL.Close
2015-11-16 Mon -2.77 2.800003
2015-11-17 Tues -2.51 -1.229996
2015-11-18 Wed -1.67 1.529999
2015-11-19 Thurs -0.89 1.140000
2015-11-20 Fri -0.32 0.100006
2015-11-23 Mon -0.23 -1.519997
2015-11-24 Tues 0.00 1.549995
2015-11-25 Wed 0.00 -1.180000
2015-11-27 Fri -0.03 -0.480003
2015-11-30 Mon 0.02 0.310005
2015-12-01 Tues -0.09 -1.410004
2015-12-02 Wed -0.31 -1.059997
2015-12-03 Thurs -0.57 -1.350006
2015-12-04 Fri -0.10 3.739998
2015-12-07 Mon 0.05 -0.700004
2015-12-08 Tues 0.12 0.710006
2015-12-09 Wed -0.24 -2.019996
2015-12-10 Thurs -0.35 0.129997
2015-12-11 Fri -0.83 -2.010002
2015-12-14 Mon -1.15 0.300003
2015-12-15 Tues -1.56 -1.450004
2015-12-16 Wed -1.56 0.269996
2015-12-17 Thurs -1.82 -3.039994
2015-12-18 Fri -2.30 -2.880005
2015-12-21 Mon -2.23 0.050003
2015-12-22 Tues -2.07 -0.169999
2015-12-23 Wed -1.64 1.340004
2015-12-24 Thurs -1.40 -0.970001
2015-12-28 Mon -1.37 -0.769996
2015-12-29 Tues -0.98 1.779999
2015-12-30 Wed -0.92 -1.260002
修改后的以下代码 运行s 但给出不同的答案
这些是修改
EMACrossModel<-ksvm( Cl(symbol[1:n]) ~ ., data=TrainingSet,kernel="rbfdot",C=10) #kernlab libraries
pred<-predict(EMACrossModel,TestSet)
结果
> EMACrossModel
Support Vector Machine object of class "ksvm"
SV type: eps-svr (regression)
parameter : epsilon = 0.1 cost C = 10
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.294836572886287
Number of Support Vectors : 17
Objective Function Value : -49.1082
Training error : 0.138329
> pred
[,1]
[1,] 119.7267
[2,] 119.9733
[3,] 120.7236
[4,] 121.8324
[5,] 121.5632
[6,] 121.4652
[7,] 119.6438
[8,] 119.6962
[9,] 119.0775
[10,] 116.4956
我除了预测结果是这样的
[,1]
-1.327996
1.229939
-1.130000
0.100006
-1.519997
-0.480003
1.310005
-1.410004
-1.059997
1.350006
-2.739998
1.700004
我的猜测是我当前的代码将股票价格而不是价格变化作为 Y 并使用它来建模 EMACrossModel。我对吗?如果是,我该如何解决这个问题。
关于问题一 您通过删除一些数据形成了您的训练集。但是,您没有限制您的符号集:
EMACrossModel<-svm( Cl(symbol[1:n]) ~ ., data=TrainingSet)
我刚刚意识到你更可能想要的是:
EMACrossModel<-svm( AAPL.Close ~ ., data=TrainingSet)
一般公式: Cl(符号[1:n]) ~ . 定义学到了什么。目前是 "symbol"。但是,我假设您要预测列 AAPL.Close。 公式是 R 中的一般概念 (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html)。花一点时间来理解这些是值得的。 编辑 根据您的上述评论,这似乎得到了证实。结果是
-0.1926745
0.3578645
0.1830046
0.6362871
-0.3760084
-0.1443156
0.2615674
0.2589130
-0.4779677
-0.5928780
编辑结束
关于问题二,它取决于实现(和内核),但这里似乎是这样。
关于你的第三个问题。 E1071 封装包含示例:
data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)
编辑 我刚刚意识到这个绘图函数只适用于分类器而不适用于回归。但是,您可以轻松构建自己的绘图函数。为了简单起见,我先将星期几转换成数字。
DataSet2$DayofWeek <- as.numeric(DataSet2$DayofWeek)
并重建分类器 之后你可以通过
可视化分类器### plot the results of the support vector machine by
# first generating a grid covering the data range
#generate a sequence of 100 numbers between the minimum and maximum of DataSet2EMA
plot.ema.vec <- seq(min(DataSet2$EMA),max(DataSet2$EMA),(max(DataSet2$EMA)-min(DataSet2$EMA))/100)
#generate a "grid" of artificial data points 1:7 are the weekdays
# can be replaced by c("Mon",...,"Sun")
datagrid <- expand.grid(1:7,plot.ema.vec)
# set the names of the grid according to the dataset s.t. the classifier can use the data as input
names(datagrid) <- names(DataSet2[,1:2])
#calculate the predictions of the classifier
grid.pred <- predict(EMACrossModel,datagrid)
# normalise the prediction in [0,1] range to use it as colors
cols <- (grid.pred-min(grid.pred))/(max(grid.pred)-min(grid.pred))
# plot the decisions for the data
plot(datagrid$DayofWeek,datagrid$EMA , col=rgb(blue=cols,red=1-cols,green=0))