Python:从数据趋势中找出异常值
Python: finding outliers from a trend of data
请注意,此 post 未复制到 SO 上的以下任何相关部分:
Find The Parity Outlier Python
我在实验中得到了数据:
import matplotlib.pyplot as plt
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
plt.plot(x, y_NaOH)
plt.plot(x, y_NaHCO3)
plt.plot(x, y_BaOH2)
plt.show()
但是,我在标记异常值时遇到了问题,这是我尝试过的方法:
import matplotlib.pyplot as plt
import statistics
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
# plt.plot(x, y_NaOH)
# plt.plot(x, y_NaHCO3)
# plt.plot(x, y_BaOH2)
# plt.show()
def detect_outlier(data_1):
threshold = 1
mean_1 = statistics.mean(data_1)
std_1 = statistics.stdev(data_1)
result_dataset = [y for y in data_1 if abs((y - mean_1)/std_1)<=threshold ]
return result_dataset
if __name__=="__main__":
dataset = y_NaHCO3
result_dataset = detect_outlier(dataset)
print(result_dataset)
# [374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0]
错误地,这种方法总是过滤掉我数据的边缘值,实际上我试图删除不符合曲线的点。
另外,我可以观察曲线的形状并手动标记异常值,但确实很费时间。我将非常感谢你的帮助。
预期输出
我想将数据绘制成直线,并将离群值标记为点,例如:
from matplotlib import pyplot as plt
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
o_NaOH = [542.2]
o_NaHCO3 = [308.0]
o_BaOH2 = [493.1]
def sketch_rejected(xv, yv, y_out):
nx = []
ny = []
x_out = []
for ii, dd in enumerate(yv):
if dd not in y_out:
nx.append(xv[ii])
ny.append(dd)
else:
x_out.append(xv[ii])
plt.plot(nx, ny)
plt.scatter(x_out, y_out)
sketch_rejected(x, y_NaOH, o_NaOH)
sketch_rejected(x, y_NaHCO3, o_NaHCO3)
sketch_rejected(x, y_BaOH2, o_BaOH2)
plt.show()
the outliers are the spiky parts of the curve which the dot doesn't fit the gradient.
我可以先使用模块回归数据,然后计算异常值,而不是手动绘制每个图形并识别异常值吗?
在现实生活中,我有大量的测试结果,但我不知道每一个的一般方程。
感谢您的帮助。
有很多 GitHub 数据科学存储库,您只需完成 git installation
from outliers.variance import graph
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
graph(
xs=x,
ys=[y_NaOH, y_NaHCO3, y_BaOH2],
title='title',
legends=[f'legend {i + 1}' for i in range(len(x))],
xlabel='xlabel',
ylabel='ylabel',
)
请注意,此 post 未复制到 SO 上的以下任何相关部分:
Find The Parity Outlier Python
我在实验中得到了数据:
import matplotlib.pyplot as plt
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
plt.plot(x, y_NaOH)
plt.plot(x, y_NaHCO3)
plt.plot(x, y_BaOH2)
plt.show()
但是,我在标记异常值时遇到了问题,这是我尝试过的方法:
import matplotlib.pyplot as plt
import statistics
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
# plt.plot(x, y_NaOH)
# plt.plot(x, y_NaHCO3)
# plt.plot(x, y_BaOH2)
# plt.show()
def detect_outlier(data_1):
threshold = 1
mean_1 = statistics.mean(data_1)
std_1 = statistics.stdev(data_1)
result_dataset = [y for y in data_1 if abs((y - mean_1)/std_1)<=threshold ]
return result_dataset
if __name__=="__main__":
dataset = y_NaHCO3
result_dataset = detect_outlier(dataset)
print(result_dataset)
# [374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0]
错误地,这种方法总是过滤掉我数据的边缘值,实际上我试图删除不符合曲线的点。
另外,我可以观察曲线的形状并手动标记异常值,但确实很费时间。我将非常感谢你的帮助。
预期输出
我想将数据绘制成直线,并将离群值标记为点,例如:
from matplotlib import pyplot as plt
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
o_NaOH = [542.2]
o_NaHCO3 = [308.0]
o_BaOH2 = [493.1]
def sketch_rejected(xv, yv, y_out):
nx = []
ny = []
x_out = []
for ii, dd in enumerate(yv):
if dd not in y_out:
nx.append(xv[ii])
ny.append(dd)
else:
x_out.append(xv[ii])
plt.plot(nx, ny)
plt.scatter(x_out, y_out)
sketch_rejected(x, y_NaOH, o_NaOH)
sketch_rejected(x, y_NaHCO3, o_NaHCO3)
sketch_rejected(x, y_BaOH2, o_BaOH2)
plt.show()
the outliers are the spiky parts of the curve which the dot doesn't fit the gradient.
我可以先使用模块回归数据,然后计算异常值,而不是手动绘制每个图形并识别异常值吗?
在现实生活中,我有大量的测试结果,但我不知道每一个的一般方程。
感谢您的帮助。
有很多 GitHub 数据科学存储库,您只需完成 git installation
from outliers.variance import graph
x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
graph(
xs=x,
ys=[y_NaOH, y_NaHCO3, y_BaOH2],
title='title',
legends=[f'legend {i + 1}' for i in range(len(x))],
xlabel='xlabel',
ylabel='ylabel',
)