如何找到两列差异较大/异常值的位置 python
How to find where two columns have bigger difference/ are outliers python
我有这两个数组:
(创建了两个随机示例数组)
x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]
这两列是相关的,x[4] 和 y[4] 是该数据的离群值
我将如何遍历数据框和return其中包含异常值的列或列号?
编辑:
道歉。这是数据框:
df = pd.DataFrame({'x':x, 'y':y})
也许这太简单了,但似乎满足了要求:-
x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]
d = [abs(_x - _y) for _x, _y in zip(x, y)]
i = d.index(max(d))
print(x[i], y[i])
没有一种种异常值去除方法,而是几十种。
我想说,鉴于 x 和 y 之间的线性关系,最好先绘制数据,然后理性地决定如何删除异常值
这里的关系显然是线性的。我使用 scipy.stats.siegelslopes
的稳健线性回归来获得稳健的拟合线。
我设计了各种异常值去除方法。拟合斜率的 ±10% 和中值差异的 ±10 倍。相比之下,@MichaelSzczesny 提出的(有效)方法等同于右侧阈值为 ~15 的方法(我使用了 6)。
import matplotlib.pyplot as plt
from scipy.stats import siegelslopes
f, (ax1, ax2) = plt.subplots(ncols=2)
xs = np.arange(0, 50)
slope, intercept = siegelslopes(df['y'], df['x'])
ax1.plot(xs, slope*xs+intercept, ls='--')
ax2.plot(xs, slope*xs+intercept, ls='--')
### variation of slope
# keep points with slope variation < 10%
df1 = df[np.log10(df['y']/(df['x']*slope+intercept)).lt(0.1)]
df1.plot.scatter('x', 'y', c='k', ax=ax1)
# plot ± 10%
ax1.plot(xs, slope*1.1*xs+intercept, c='grey', ls=':')
ax1.plot(xs, slope*0.9*xs+intercept, c='grey', ls=':')
# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax1)
ax1.set_ylim(ymin=0)
ax1.set_xlim(xmin=0
### keep points with intercept variation ± 10 * median x-y difference
d = abs(df['y']-(df['x']*slope+intercept))
thresh = d.median()*10
df1 = df[d.lt(thresh)]
df1.plot.scatter('x', 'y', c='k', ax=ax2)
# plot ± threshold
ax2.plot(xs, slope*xs+intercept+thresh, c='grey', ls=':')
ax2.plot(xs, slope*xs+intercept-thresh, c='grey', ls=':')
# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax2)
ax2.set_ylim(ymin=0)
ax2.set_xlim(xmin=0)
我有这两个数组: (创建了两个随机示例数组)
x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]
这两列是相关的,x[4] 和 y[4] 是该数据的离群值
我将如何遍历数据框和return其中包含异常值的列或列号?
编辑: 道歉。这是数据框:
df = pd.DataFrame({'x':x, 'y':y})
也许这太简单了,但似乎满足了要求:-
x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]
d = [abs(_x - _y) for _x, _y in zip(x, y)]
i = d.index(max(d))
print(x[i], y[i])
没有一种种异常值去除方法,而是几十种。
我想说,鉴于 x 和 y 之间的线性关系,最好先绘制数据,然后理性地决定如何删除异常值
这里的关系显然是线性的。我使用 scipy.stats.siegelslopes
的稳健线性回归来获得稳健的拟合线。
我设计了各种异常值去除方法。拟合斜率的 ±10% 和中值差异的 ±10 倍。相比之下,@MichaelSzczesny 提出的(有效)方法等同于右侧阈值为 ~15 的方法(我使用了 6)。
import matplotlib.pyplot as plt
from scipy.stats import siegelslopes
f, (ax1, ax2) = plt.subplots(ncols=2)
xs = np.arange(0, 50)
slope, intercept = siegelslopes(df['y'], df['x'])
ax1.plot(xs, slope*xs+intercept, ls='--')
ax2.plot(xs, slope*xs+intercept, ls='--')
### variation of slope
# keep points with slope variation < 10%
df1 = df[np.log10(df['y']/(df['x']*slope+intercept)).lt(0.1)]
df1.plot.scatter('x', 'y', c='k', ax=ax1)
# plot ± 10%
ax1.plot(xs, slope*1.1*xs+intercept, c='grey', ls=':')
ax1.plot(xs, slope*0.9*xs+intercept, c='grey', ls=':')
# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax1)
ax1.set_ylim(ymin=0)
ax1.set_xlim(xmin=0
### keep points with intercept variation ± 10 * median x-y difference
d = abs(df['y']-(df['x']*slope+intercept))
thresh = d.median()*10
df1 = df[d.lt(thresh)]
df1.plot.scatter('x', 'y', c='k', ax=ax2)
# plot ± threshold
ax2.plot(xs, slope*xs+intercept+thresh, c='grey', ls=':')
ax2.plot(xs, slope*xs+intercept-thresh, c='grey', ls=':')
# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax2)
ax2.set_ylim(ymin=0)
ax2.set_xlim(xmin=0)