在散点图上标记异常值

Question

我有一个如下所示的数据框：

 print(df.head(10))

 day         CO2
   1  549.500000
   2  663.541667
   3  830.416667
   4  799.695652
   5  813.850000
   6  769.583333
   7  681.941176
   8  653.333333
   9  845.666667
  10  436.086957

然后我使用以下函数和代码行从 CO2 列中获取异常值：

def estimate_gaussian(dataset):

    mu = np.mean(dataset)#moyenne cf mu
    sigma = np.std(dataset)#écart_type/standard deviation
    limit = sigma * 1.5

    min_threshold = mu - limit
    max_threshold = mu + limit

    return mu, sigma, min_threshold, max_threshold

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)


condition1 = (dataset < min_threshold)
condition2 = (dataset > max_threshold)

outliers1 = np.extract(condition1, dataset)
outliers2 = np.extract(condition2, dataset)

outliers = np.concatenate((outliers1, outliers2), axis=0)

这给了我以下结果：

print(outliers)

[830.41666667 799.69565217 813.85       769.58333333 845.66666667]

现在我想在散点图上用红色标记那些异常值。

你可以在下面找到我到目前为止用来在散点图上用红色标记单个离群值的代码，但我找不到对离群值列表中的每个元素都这样做的方法 numpy.ndarray:

y = df['CO2']

x = df['day']

col = np.where(x<0,'k',np.where(y<845.66666667,'b','r'))

plt.scatter(x, y, c=col, s=5, linewidth=3)
plt.show()

这是我得到的，但我希望所有 ouliers 的结果相同。你能帮帮我吗？

https://ibb.co/Ns9V7Zz

Answer 1

可能不是最有效的解决方案，但我觉得多次调用 plt.scatter 更容易，每次都传递一个 xy 对。因为我们从不调用新图形（例如使用 plt.figure()），所以每个 xy 对都绘制在同一个图形上。

然后，在每次迭代中，我们只需要检查 y 值是否为异常值。如果是，我们更改 plt.scatter 调用中的 color 关键字参数。

试试这个：

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)

xs = df['day']
ys = df['CO2']

for x, y in zip(xs, ys):
    color = 'blue'  # non-outlier color
    if not min_threshold <= y <= max_threshold:  # condition for being an outlier
        color = 'red'  # outlier color
    plt.scatter(x, y, color=color)
plt.show()

Answer 2

您可以创建一个额外的列（布尔值），在其中定义该点是异常值 (True) 还是非异常值 (False)，然后使用两个散点图：

df["outlier"] = # your boolean np array goes in here
plt.scatter[df.loc[df["outlier"], "day"], df.loc[df["outlier"], "CO2"], color="k"]
plt.scatter[df.loc[~df["outlier"], "day"], df.loc[~df["outlier"], "CO2"], color="r"]

Answer 3

我不确定你的 col 列表背后的想法是什么，但你可以用

替换 col

col = ['red' if yy in list(outliers) else 'blue' for yy in y]

Answer 4

有几种方法，一种是根据您的条件创建颜色序列并将其传递给 c 参数。

df = pd.DataFrame({'CO2': {0: 549.5,
  1: 663.54166699999996,
  2: 830.41666699999996,
  3: 799.695652,
  4: 813.85000000000002,
  5: 769.58333300000004,
  6: 681.94117599999993,
  7: 653.33333300000004,
  8: 845.66666699999996,
  9: 436.08695700000004},
 'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}})

In [11]: colors = ['r' if n<750 else 'b' for n in df['CO2']]

In [12]: colors
Out[12]: ['r', 'r', 'b', 'b', 'b', 'b', 'r', 'r', 'b', 'r']

In [13]: plt.scatter(df['day'],df['CO2'],c=colors)

或使用np.where创建序列

In [14]: colors = np.where(df['CO2'] < 750, 'r', 'b')

Answer 5

这是一个快速解决方案：

我将重新创建您已经开始的内容。您只共享了数据框的头部，但无论如何，我只是插入了一些随机异常值。看起来你的 "estimate_gaussian()" 函数只能 return 两个异常值？

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame([549.500000,
                50.0000000,
                830.416667,
                799.695652,
                1200.00000,
                769.583333,
                681.941176,
                1300.00000,
                845.666667,
                436.086957], 
                columns=['CO2'],
                index=list(range(1,11)))

def estimate_gaussian(dataset):

    mu = np.mean(dataset) # moyenne cf mu
    sigma = np.std(dataset) # écart_type/standard deviation
    limit = sigma * 1.5

    min_threshold = mu - limit
    max_threshold = mu + limit

    return mu, sigma, min_threshold, max_threshold

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df.values)

condition1 = (df < min_threshold)
condition2 = (df > max_threshold)

outliers1 = np.extract(condition1, df)
outliers2 = np.extract(condition2, df)

outliers = np.concatenate((outliers1, outliers2), axis=0)

然后我们绘制：

df_red = df[df.values==outliers]

plt.scatter(df.index,df.values)
plt.scatter(df_red.index,df_red.values,c='red')
plt.show()

如果您需要更细致的东西，请告诉我！

在散点图上标记异常值

Marking outliers on a Scatter Plot

python

plot

matplotlib

scatter-plot

outliers