如何对 Python 中的特定数据列应用 Shapiro-Wilk 检验
How to apply the Shapiro-Wilk Test on a specific data column in Python
我想对 SPY 的每日百分比 return 应用此测试。从 Yahoo 获取该品种的历史数据后,我每天计算 returns 的百分比(如您在下面的代码中所见)。但是当我应用测试时,P 值始终为“1.00”,统计数据的 return 始终为“nan”。无论我更改数据日期还是更改符号(例如,QQQ 代替 SPY)
下面你可以看到我正在使用的代码:
from datetime import date
import pandas_datareader as dr
from scipy.stats import shapiro
df = dr.data.get_data_yahoo('spy',start='2010-01-01',end='2015-01-01')
df['PCT'] = df['Close'].pct_change()
stat, p = shapiro(df['PCT'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
不幸的是,我尝试了不同的方法,但找不到解决方案。我被它困住了。关于如何将其正确应用于 PCT 列数据的任何想法?任何帮助都将非常受欢迎!谢谢!
Step 1: "when I apply the test the P value is always “1.00
” and the return of the stats is always “nan
”
不,先生,不是。
print( 'Statistics\n(W)= %e,\n p = %e' % ( stat, p ) ) # will produce:
...
(W)= 9.438160e-01
p = 1.909053e-21
核心问题是,尊重事物的运作方式:
>>> df['PCT'] = df['Close'].pct_change() # this computes & stores .pct_change()
>>> df # read print( df['Close'].pct_change.__doc__ )
High ... Close Volume Adj Close PCT
Date
2010-01-04 113.389999 ... 113.330002 118944600.0 93.675278 NaN
2010-01-05 113.680000 ... 113.629997 111579900.0 93.923241 0.002647
2010-01-06 113.989998 ... 113.709999 116074400.0 93.989357 0.000704
2010-01-07 114.330002 ... 114.190002 131091100.0 94.386139 0.004221
2010-01-08 114.620003 ... 114.570000 126402800.0 94.700218 0.003328
...
显然,由于 period == 1
,单元格 df['PCT'][0]
是并且必须是 NaN
因此,调用 W_stat, p_value = shapiro( df['PCT'][1:] )
不要包含没有意义的值 w.r.t。 shapiro()
print( shapiro.__doc__ ) # for more details
将值与参考样本进行比较 - 正态分布测试,其中 没有 NaN
-s 直接导致必须拒绝 null-hypothesis 被p == 1
绝对肯定的拒绝(从两个"incomparable-due-to-NaN
(s)"组比较的角度来看,这显然是正确的)。
同样{ SPY | QQQ | AAPL | AMZN | ... }
:
>>> shapiro( dr.data.get_data_yahoo( 'SPY',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.943816065788269, 1.9090532861060437e-21)
>>> shapiro( dr.data.get_data_yahoo( 'QQQ',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.9631340503692627, 2.548133516564297e-17)
>>> shapiro( dr.data.get_data_yahoo( 'AAPL',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.9560988545417786, 5.674560331738808e-19)
>>> shapiro( dr.data.get_data_yahoo( 'AMZN',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.9394155740737915, 3.106424182886848e-22)
我想对 SPY 的每日百分比 return 应用此测试。从 Yahoo 获取该品种的历史数据后,我每天计算 returns 的百分比(如您在下面的代码中所见)。但是当我应用测试时,P 值始终为“1.00”,统计数据的 return 始终为“nan”。无论我更改数据日期还是更改符号(例如,QQQ 代替 SPY)
下面你可以看到我正在使用的代码:
from datetime import date
import pandas_datareader as dr
from scipy.stats import shapiro
df = dr.data.get_data_yahoo('spy',start='2010-01-01',end='2015-01-01')
df['PCT'] = df['Close'].pct_change()
stat, p = shapiro(df['PCT'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
不幸的是,我尝试了不同的方法,但找不到解决方案。我被它困住了。关于如何将其正确应用于 PCT 列数据的任何想法?任何帮助都将非常受欢迎!谢谢!
Step 1: "when I apply the test the P value is always “
1.00
” and the return of the stats is always “nan
”
不,先生,不是。
print( 'Statistics\n(W)= %e,\n p = %e' % ( stat, p ) ) # will produce:
...
(W)= 9.438160e-01
p = 1.909053e-21
核心问题是,尊重事物的运作方式:
>>> df['PCT'] = df['Close'].pct_change() # this computes & stores .pct_change()
>>> df # read print( df['Close'].pct_change.__doc__ )
High ... Close Volume Adj Close PCT
Date
2010-01-04 113.389999 ... 113.330002 118944600.0 93.675278 NaN
2010-01-05 113.680000 ... 113.629997 111579900.0 93.923241 0.002647
2010-01-06 113.989998 ... 113.709999 116074400.0 93.989357 0.000704
2010-01-07 114.330002 ... 114.190002 131091100.0 94.386139 0.004221
2010-01-08 114.620003 ... 114.570000 126402800.0 94.700218 0.003328
...
显然,由于 period == 1
,单元格 df['PCT'][0]
是并且必须是 NaN
因此,调用 W_stat, p_value = shapiro( df['PCT'][1:] )
不要包含没有意义的值 w.r.t。 shapiro()
print( shapiro.__doc__ ) # for more details
将值与参考样本进行比较 - 正态分布测试,其中 没有 NaN
-s 直接导致必须拒绝 null-hypothesis 被p == 1
绝对肯定的拒绝(从两个"incomparable-due-to-NaN
(s)"组比较的角度来看,这显然是正确的)。
同样{ SPY | QQQ | AAPL | AMZN | ... }
:
>>> shapiro( dr.data.get_data_yahoo( 'SPY',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.943816065788269, 1.9090532861060437e-21)
>>> shapiro( dr.data.get_data_yahoo( 'QQQ',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.9631340503692627, 2.548133516564297e-17)
>>> shapiro( dr.data.get_data_yahoo( 'AAPL',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.9560988545417786, 5.674560331738808e-19)
>>> shapiro( dr.data.get_data_yahoo( 'AMZN',
start = '2010-01-01',
end = '2015-01-01'
)['Close'].pct_change()[1:]
)
(0.9394155740737915, 3.106424182886848e-22)