通过使用 python 消除异常值来达到目标斜率值

Question

我有一个数据集，我可以从中消除最多两个数据点以达到 10 的目标斜率。我的异常值拒绝标准是说如果目标值 (10 ), 没关系。但是，超出此范围的内容将被删除。

一组试验数据如下：

从图像的左侧可以看出，得到三个斜率=11.6、10.5和9.4。但是目标斜率是 10。

在数据的右侧，我删除了倾斜斜率的数据点，即不允许它达到目标斜率 10。

这只是一个构建的数据集，但概念与我需要的最终数据集相似。

我如何在 python 中完成它？非常感谢在此问题上的任何帮助。

Answer 1

首先，如果您已经知道所需的斜率，则可以在 python 中完成此操作，但如果您有大量数据，则需要小心。其次，以 5% 为标准，斜率 10.5 不会被修正。

您要求的解决方案

#some imports
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
df = read_csv('your_file.csv')
state = 'USA'
desire_slope = 10
x = df[df['Country']==state][x]
y = df[df['Country']==state][y]

'''to use for test
x = [ 4+(i/10) for i in range(100)]
y = [c*11+norm.rvs()*4 for c in x ]
'''
z = [abs(v-desire_slope*c) for v,c in zip(y,x)]

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print(slope)
if(abs(slope-desire_slope)/slope<0.05):
    print("slope is fine")
else:
    sorted_index_pos = [index for index, num in sorted(enumerate(z), key=lambda x: x[-1])][-2:]
    print(sorted_index_pos)
    del x[sorted_index_pos[-1]]
    del y[sorted_index_pos[-1]]
    del x[sorted_index_pos[0]]
    del y[sorted_index_pos[0]]

new_slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print(new_slope)

输出:

11.08066739990693
[78, 85]
11.026005655263733

为什么要小心

首先我们不考虑拦截，这可能是个问题。另外，如果我运行以下内容：

x = [ 4+(i/100) for i in range(1000)]
y = [c*10+norm.rvs()*4 for c in x ]

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print("the slope here is: "+str(slope))
z = [c*slope for c in x]
print("average of values: "+str(sum(x)/len(x)))
plt.plot(x,y,'b',x,z,'r-')

我得到以下输出：

the slope here is: 10.04367376783041
average of values: 8.995

这表明点不一定均匀分布在斜坡的两侧。乘坐远处的点可能会使数据集更加不平衡，因此不会改善斜率。所以做的时候要小心

通过使用 python 消除异常值来达到目标斜率值

Reach a target slope value by eliminating outlier using python

python

scipy

python-3.x

pandas

statsmodels

通过使用 python 消除异常值来达到目标​​斜率值

Reach a target slope value by eliminating outlier using python

python

scipy

python-3.x

pandas

statsmodels

通过使用 python 消除异常值来达到目标斜率值