使用 statsmodels 为 DataFrame 中的多个 yfinance 数据计算 r 平方时出错

Error calculating r squared with statsmodels for multiple yfinance data in a DataFrame

我最近开始学习 Python,但我已经在 Excel 开始学习一个复杂的项目。到目前为止,我对我使用的代码使用了不同的指南,并根据我的需要进行了调整。

我正在使用 'yfinance' 从 Yahoo! 收集特定时间段内多种加密货币的数据金融。此外,'stats models' 使用由所有加密货币创建的 DataFrame 和带有 mkt 的附加列来获取 alpha、beta 和 r 平方。 return(x 变量)。

我遇到以下错误:ValueError:endog 和 exog 矩阵大小不同。我看到另一个关于此错误的 question/answer,但它似乎与我的问题无关。

错误发生在以下代码的第87行[model = sm.OLS(Y2,X_)]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime

from pandas_datareader import data as pdr
import yfinance as yf

yf.pdr_override()

df1 = pdr.get_data_yahoo("BTC-USD", start="2015-01-01", end="2020-01-01")
df2 = pdr.get_data_yahoo("ETH-USD", start="2015-01-01", end="2020-01-01")
df3 = pdr.get_data_yahoo("XRP-USD", start="2015-01-01", end="2020-01-01")
df4 = pdr.get_data_yahoo("BCH-USD", start="2015-01-01", end="2020-01-01")
df5 = pdr.get_data_yahoo("USDT-USD", start="2015-01-01", end="2020-01-01")
df6 = pdr.get_data_yahoo("BSV-USD", start="2015-01-01", end="2020-01-01")
df7 = pdr.get_data_yahoo("LTC-USD", start="2015-01-01", end="2020-01-01")
df8 = pdr.get_data_yahoo("BNB-USD", start="2015-01-01", end="2020-01-01")
df9 = pdr.get_data_yahoo("EOS-USD", start="2015-01-01", end="2020-01-01")
df10 = pdr.get_data_yahoo("LINK-USD", start="2015-01-01", end="2020-01-01")
df11 = pdr.get_data_yahoo("XMR-USD", start="2015-01-01", end="2020-01-01")
df12 = pdr.get_data_yahoo("BTG-USD", start="2015-01-01", end="2020-01-01")

return_btc = df1.Close.pct_change()[1:]
return_eth = df2.Close.pct_change()[1:]
return_xrp = df3.Close.pct_change()[1:]
return_bch = df4.Close.pct_change()[1:]
return_usdt = df5.Close.pct_change()[1:]
return_bsv = df6.Close.pct_change()[1:]
return_ltc = df7.Close.pct_change()[1:]
return_bnb = df8.Close.pct_change()[1:]
return_eos = df9.Close.pct_change()[1:]
return_link = df10.Close.pct_change()[1:]
return_xmr = df11.Close.pct_change()[1:]
return_btg = df12.Close.pct_change()[1:]

d = {"BTC Return":return_btc, "ETH Return":return_eth, "XRP Return":return_xrp, "BCH Return":return_bch, 
"USDT Return":return_usdt, "BSV Return":return_bsv, "LTC Return":return_ltc, "BNB Return":return_bnb, 
"EOS Return":return_eos, "LINK Return":return_link, "XMR Return":return_xmr, "BTG Return":return_btg}

df = pd.DataFrame(d) # new data frame with all returns data

df = pd.DataFrame(d, columns=["Date", "BTC Return", "ETH Return", "XRP Return", "BCH Return", "USDT Return", "BSV Return", 
"LTC Return", "BNB Return", "EOS Return", "LINK Return", "XMR Return", "BTG Return"])

avg_row = df.mean(axis=1)
return_mkt = avg_row

d1 = {"BTC Return":return_btc, "ETH Return":return_eth, "XRP Return":return_xrp, "BCH Return":return_bch, 
"USDT Return":return_usdt, "BSV Return":return_bsv, "LTC Return":return_ltc, "BNB Return":return_bnb, 
"EOS Return":return_eos, "LINK Return":return_link, "XMR Return":return_xmr, "BTG Return":return_btg, "MKT Return":return_mkt}
df = pd.DataFrame(d1)
print(df)

import statsmodels.api as sm
from statsmodels import regression

X = return_mkt.values
Y1 = return_btc
Y2 = return_eth
#Y3 = return_xrp

def linreg(x,y):
    x = sm.add_constant(x)
    model = regression.linear_model.OLS(y,x).fit()

    # we are removing the constant
    x = x[:, 1]
    return model.params[0], model.params[1]

X_ = sm.add_constant(X) # artificially add intercept to x, as advised in the docs
model = sm.OLS(Y1,X_)
results = model.fit()
rsquared = results.rsquared

alpha, beta = linreg(X,Y1)

def linreg(x,y):
    x = sm.add_constant(x)
    model = regression.linear_model.OLS(y,x).fit()

    # we are removing the constant
    x = x[:, 1]
    return model.params[0], model.params[1]

X_ = sm.add_constant(X) # artificially add intercept to x, as advised in the docs
model = sm.OLS(Y2,X_)
results = model.fit()
rsquared = results.rsquared

alpha, beta = linreg(X,Y2)

错误位于第二个 def,因为我正在尝试为每种加密货币计算前面提到的统计数据。因此,第一个 def 用于 BTC (Y1),第二个 def 用于 ETH (Y2),依此类推 (Y3,...)。

当我最后只有 BTC 的功能时,整个代码没有问题,当我试图为其他人添加更多相同的功能时,出现了错误。

从根本上说,问题是因为以太坊(和所有其他加密货币)比比特币起步晚,前几年每天都有价格空值,这是无法处理的。所以你必须只取不为空的值。

但是,您的代码中有很多东西可以分解,这样您就不会不必要地重复自己。您尝试使用 linreg 函数进行此操作,但随后您为第二个加密重新定义了它,这不是必需的。

这里是一个快速重写,它解决了基本问题,并希望能说明我上面的意思。输出是一个数据框,其中包含您正在寻找的统计数据,按加密货币分类。目标是编写尽可能多的代码 'generically',然后只提供您感兴趣的加密货币列表。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas_datareader import data as pdr
import datetime
import yfinance as yf
import statsmodels.api as sm
from statsmodels import regression

yf.pdr_override()

cryptos = ["BTC", "ETH", "XRP"]  # Here you can specify the cryptos you want. I just used 3 for demonstration
                                 # The rest of the code is not specific to any one crypto

def get_and_process_data(c):
    raw_data = pdr.get_data_yahoo(c + '-USD', start="2015-01-01", end="2020-01-01")
    return raw_data.Close.pct_change()[1:]

df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})


df['avg_return'] = df.mean(axis=1) # avg market return
print(df)

def model(x, y):
    # Calculate r-squared
    X = sm.add_constant(x) # artificially add intercept to x, as advised in the docs
    model = sm.OLS(y,X).fit()
    rsquared = model.rsquared
    
    # Fit linear regression and calculate alpha and beta
    X = sm.add_constant(x)
    model = regression.linear_model.OLS(y,X).fit()
    alpha = model.params[0]
    beta = model.params[1]

    return rsquared, alpha, beta


results = pd.DataFrame({c: model(df[df[c].notnull()]['avg_return'], df[df[c].notnull()][c]) for c in cryptos}).transpose()
results.columns = ['rsquared', 'alpha', 'beta']
print(results)