如何替换pandas中的离群数据?

How to replace outlier data in pandas?

我有一个从雅虎财经抓取的股票数据,调整后的收盘数据不知何故是错误的。

            adj_close    close     ratio
date                                    
2014-10-16   240.4076  2466.40  0.097473
2014-10-17   245.8173  2521.90  0.097473
2014-10-20   250.4522  2569.45  0.097473
2014-10-21   251.8850  2584.15  0.097473
2014-10-22   251.0175  2575.25  0.097473
2014-10-23   251.3392  2578.55  0.097473
2014-10-27   253.2155  2597.80  0.097473
2014-10-28   258.9616  2656.75  0.097473
2014-10-29   257.6944  2643.75  0.097473
2014-10-30   257.1339  2638.00  0.097473
2014-10-31    26.3450  2702.80  0.009747
2014-11-03    26.5463  2723.45  0.009747
2014-11-05    27.1160  2781.90  0.009747
2014-11-07    26.7320  2742.50  0.009747
2014-11-10    26.7027  2739.50  0.009747

这是调整后的收盘数据图:

如何使用插值等方法替换这样的数据?

试试这个:

In [71]: import pandas_datareader.data as web

In [110]: df = web.DataReader('SBIN.NS', 'yahoo', '2014-10-21', '2014-11-25')

In [111]: df
Out[111]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200    26.3450   # <bad_data>
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600    26.5463
2014-11-04   272.3450   272.3450   272.3450   272.3450         0    26.5463
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100    27.1160
2014-11-06   278.1900   278.1900   278.1900   278.1900         0    27.1160
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000    26.7320
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800    26.7027
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900    26.7110
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200    26.7666
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000    26.5059
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600    27.1740
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100    28.6629
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300    28.7131
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900    28.3695   # </bad_data>
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

注意:Adj Close 列从 2014-11-20 开始已 恢复 ,其他列 - ,所以我将只关注 Adj Close:

让我们找出离群值(我正在检查前一天 50+% 发生变化的异常值 - 您可能想要更改此阈值):

In [112]: bad_idx = df.index[df['Adj Close'].pct_change().abs().ge(0.5)]

In [113]: bad_idx
Out[113]: DatetimeIndex(['2014-10-31', '2014-11-20'], dtype='datetime64[ns]', name='Date', freq=None)

In [114]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] *= 10

In [115]: df
Out[115]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200   263.4500
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600   265.4630
2014-11-04   272.3450   272.3450   272.3450   272.3450         0   265.4630
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100   271.1600
2014-11-06   278.1900   278.1900   278.1900   278.1900         0   271.1600
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000   267.3200
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800   267.0270
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900   267.1100
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200   267.6660
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000   265.0590
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600   271.7400
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100   286.6290
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300   287.1310
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900   283.6950
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

这是另一个使用插值的解决方案:

In [119]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] = np.nan

In [120]: df
Out[120]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200        NaN
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600        NaN
2014-11-04   272.3450   272.3450   272.3450   272.3450         0        NaN
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100        NaN
2014-11-06   278.1900   278.1900   278.1900   278.1900         0        NaN
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000        NaN
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800        NaN
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900        NaN
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200        NaN
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000        NaN
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600        NaN
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100        NaN
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300        NaN
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900        NaN
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

In [122]: df['Adj Close'] = df['Adj Close'].interpolate()

In [123]: df
Out[123]:
                 Open       High        Low      Close    Volume   Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300  251.885000
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100  251.017500
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200  251.339200
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0  251.339200
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500  253.215500
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400  258.961600
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900  257.694400
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200  257.133900
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200  259.297807
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600  261.461713
2014-11-04   272.3450   272.3450   272.3450   272.3450         0  263.625620
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100  265.789527
2014-11-06   278.1900   278.1900   278.1900   278.1900         0  267.953433
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000  270.117340
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800  272.281247
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900  274.445153
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200  276.609060
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000  278.772967
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600  280.936873
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100  283.100780
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300  285.264687
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900  287.428593
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500  289.592500
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200  297.780200
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400  301.045600
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600  296.756800

问题不完全是数据...问题是您还没有了解这里的市场基本面。因此,与其将其视为数学问题 ("replace outliers"),不如将其视为数据 sourcing/cleaning 问题(修复数据)。

您正在查看的代码是 SBIN.NS(印度国家银行)。它在 2014 年 11 月 21 日进行了 1:10 拆分,如此处报道:http://articles.economictimes.indiatimes.com/2014-11-20/news/56304010_1_india-gains-state-bank-stock-split

您显示的图表清楚地表明雅虎数据在该日期前后出现了问题。

所以发生了什么?

第一个中断发生在 2014 年 10 月 31 日,当时雅虎显示价格跌至 1:10。这显然是一个错误。我猜想他们的自动公司行为解析器在该日期收到了待处理 1:10 拆分的通知,并立即应用它,而不是记录日期 2014-11-21。因此,从 2014-10-31 到 2014-11-20(含),您的价格错误了 10 倍。

在这种情况下,除非您可以从其他来源获取数据,否则最好的解决方案是简单地记下此错误并将错误日期的雅虎价格乘以 10。