OLS 中弃用的滚动 window 选项从 Pandas 到 Statsmodels
Deprecated rolling window option in OLS from Pandas to Statsmodels
如标题所示,Pandas中ols命令中的滚动功能选项迁移到statsmodels中的哪里了?我好像找不到。
Pandas 告诉我厄运即将来临:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
事实上,如果你这样做:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
你得到了结果(window 不影响代码的 运行 宁)但是你只得到了整个时期回归 运行 的参数,而不是系列它应该工作的每个滚动周期的参数。
使用 sklearn 进行滚动测试
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
为完整性添加一个更快的 numpy
-only 解决方案,该解决方案仅将计算限制在回归系数和最终估计值
Numpy 滚动回归函数
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
备注
在某些特定情况下使用,只需要回归的最终估计,x.rolling(window=60).apply(my_ols)
显得有些慢
提醒一下,回归系数可以计算为矩阵乘积,您可以在 wikipedia's least squares page 上阅读。这种通过 numpy
的矩阵乘法的方法与使用 statsmodels
中的 ols 相比可以稍微加快处理速度。此产品在以 coeff = ...
开头的行中表示
我创建了一个 ols
模块,旨在模仿 pandas' 已弃用 MovingOLS
;它是 here.
三核类:
OLS
:静态(单window)普通最小二乘回归。输出是 NumPy 数组
RollingOLS
:滚动(多window)普通最小二乘回归。输出是高维 NumPy 数组。
PandasRollingOLS
:将 RollingOLS
的结果包装在 pandas 系列和数据帧中。旨在模仿已弃用的 pandas 模块的外观。
请注意,该模块是 package 的一部分(我目前正在将其上传到 PyPi),它需要一个包间导入。
上面的前两个 类 完全在 NumPy 中实现,主要使用矩阵代数。 RollingOLS
也广泛利用广播。属性在很大程度上模仿了 statsmodels 的 OLS RegressionResultsWrapper
.
一个例子:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64
对于一栏的滚动趋势,可以直接使用:
import numpy as np
def calc_trend(window:int = 30):
df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)
但是,在我的例子中,我浪费了时间来寻找与日期相关的趋势,因为日期在另一列中。我必须手动创建功能,但这很容易。首先,将 TimeDate 转换为 int64,表示从 t_0:
开始的天数
xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)
然后:
def calc_trend(window:int=30):
for t in range(len(df)):
if t < window//2:
continue
i0 = t - window//2 # Start window
i1 = i0 + window # End window
xvec = xdays[i0:i1]
yvec = df['column_name'][i0:i1].values
df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]
如标题所示,Pandas中ols命令中的滚动功能选项迁移到statsmodels中的哪里了?我好像找不到。 Pandas 告诉我厄运即将来临:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
事实上,如果你这样做:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
你得到了结果(window 不影响代码的 运行 宁)但是你只得到了整个时期回归 运行 的参数,而不是系列它应该工作的每个滚动周期的参数。
使用 sklearn 进行滚动测试
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
为完整性添加一个更快的 numpy
-only 解决方案,该解决方案仅将计算限制在回归系数和最终估计值
Numpy 滚动回归函数
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
备注
在某些特定情况下使用,只需要回归的最终估计,x.rolling(window=60).apply(my_ols)
显得有些慢
提醒一下,回归系数可以计算为矩阵乘积,您可以在 wikipedia's least squares page 上阅读。这种通过 numpy
的矩阵乘法的方法与使用 statsmodels
中的 ols 相比可以稍微加快处理速度。此产品在以 coeff = ...
我创建了一个 ols
模块,旨在模仿 pandas' 已弃用 MovingOLS
;它是 here.
三核类:
OLS
:静态(单window)普通最小二乘回归。输出是 NumPy 数组RollingOLS
:滚动(多window)普通最小二乘回归。输出是高维 NumPy 数组。PandasRollingOLS
:将RollingOLS
的结果包装在 pandas 系列和数据帧中。旨在模仿已弃用的 pandas 模块的外观。
请注意,该模块是 package 的一部分(我目前正在将其上传到 PyPi),它需要一个包间导入。
上面的前两个 类 完全在 NumPy 中实现,主要使用矩阵代数。 RollingOLS
也广泛利用广播。属性在很大程度上模仿了 statsmodels 的 OLS RegressionResultsWrapper
.
一个例子:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64
对于一栏的滚动趋势,可以直接使用:
import numpy as np
def calc_trend(window:int = 30):
df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)
但是,在我的例子中,我浪费了时间来寻找与日期相关的趋势,因为日期在另一列中。我必须手动创建功能,但这很容易。首先,将 TimeDate 转换为 int64,表示从 t_0:
开始的天数xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)
然后:
def calc_trend(window:int=30):
for t in range(len(df)):
if t < window//2:
continue
i0 = t - window//2 # Start window
i1 = i0 + window # End window
xvec = xdays[i0:i1]
yvec = df['column_name'][i0:i1].values
df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]