如何找到两列差异较大/异常值的位置 python

Question

我有这两个数组：（创建了两个随机示例数组）

x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]

这两列是相关的，x[4] 和 y[4] 是该数据的离群值

我将如何遍历数据框和return其中包含异常值的列或列号？

编辑：道歉。这是数据框：

df = pd.DataFrame({'x':x, 'y':y})

Answer 1

也许这太简单了，但似乎满足了要求：-

x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]
d = [abs(_x - _y) for _x, _y in zip(x, y)]
i = d.index(max(d))
print(x[i], y[i])

Answer 2

没有一种种异常值去除方法，而是几十种。

我想说，鉴于 x 和 y 之间的线性关系，最好先绘制数据，然后理性地决定如何删除异常值

这里的关系显然是线性的。我使用 scipy.stats.siegelslopes 的稳健线性回归来获得稳健的拟合线。

我设计了各种异常值去除方法。拟合斜率的 ±10% 和中值差异的 ±10 倍。相比之下，@MichaelSzczesny 提出的（有效）方法等同于右侧阈值为 ~15 的方法（我使用了 6）。

import matplotlib.pyplot as plt
from scipy.stats import siegelslopes

f, (ax1, ax2) = plt.subplots(ncols=2)

xs = np.arange(0, 50)
slope, intercept = siegelslopes(df['y'], df['x'])
ax1.plot(xs, slope*xs+intercept, ls='--')
ax2.plot(xs, slope*xs+intercept, ls='--')

### variation of slope

# keep points with slope variation < 10%
df1 = df[np.log10(df['y']/(df['x']*slope+intercept)).lt(0.1)]
df1.plot.scatter('x', 'y', c='k', ax=ax1)

# plot ± 10%
ax1.plot(xs, slope*1.1*xs+intercept, c='grey', ls=':')
ax1.plot(xs, slope*0.9*xs+intercept, c='grey', ls=':')

# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax1)

ax1.set_ylim(ymin=0)
ax1.set_xlim(xmin=0

### keep points with intercept variation ± 10 * median x-y difference

d = abs(df['y']-(df['x']*slope+intercept))
thresh = d.median()*10
df1 = df[d.lt(thresh)]
df1.plot.scatter('x', 'y', c='k', ax=ax2)

# plot ± threshold
ax2.plot(xs, slope*xs+intercept+thresh, c='grey', ls=':')
ax2.plot(xs, slope*xs+intercept-thresh, c='grey', ls=':')

# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax2)

ax2.set_ylim(ymin=0)
ax2.set_xlim(xmin=0)

如何找到两列差异较大/异常值的位置 python

How to find where two columns have bigger difference/ are outliers python

python

arrays

numpy

outliers

pandas