使用 Python 查找分散数据集的平均值(或滚动平均值)
Finding the mean value (or rolling average) of a scattered dataset with Python
我有一个大型的两列数据集,代表分散的功能行为。假设对于每个时间值 (x),存在一定数量的广泛传播的测量值 (y)。我想为每个时间值(或考虑特定时间间隔内的直方图)获取其中测量值 y 的平均值。我正在搜索 rolling/moving 平均值和样条插值,但我被卡住了。下面是应该发生什么的最小示例代码:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline
#generate testdata which usually is read in from a huge file
def testdata(x):
return 1/(1+10.*x**2)
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))
#convert it to dataframes, as I usually work with them
df = pd.DataFrame(list(zip(x,y)))
#sort the x-values as they are randomly distributed in the dataset
df_new = df.sort_values(by=[0])
#show the data and how the (analytical average shhould look like)
plt.scatter(df_new[0],df_new[1],s=1)
plt.scatter(df_new[0],testdata(df_new[0]), s=1, c='r')
#try a spline - however it fails
spl = UnivariateSpline(df_new.iloc[:, 0], df_new.iloc[:, 1])
xs = np.linspace(-1, 1, 10000)
plt.plot(xs, spl(xs), 'g--', lw=3)
plt.show()
所以蓝色是我的数据 - 红色是平均值的样子(在这个测试用例中我显然知道),绿色是样条方法给我的。
你们中肯定有人知道通过智能(内置)算法实现红色曲线的更好方法吗?
您可以 round
x
值然后 groupby
得到平均值。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline
# generate testdata
def testdata(x):
return 1/(1+10.*x**2)
# create x, y
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))
# convert to df and sort values inplace
df = pd.DataFrame({'x': x, 'y': y})
df.sort_values(by='x', inplace=True)
# round x values then group by to create bins
round_by = 2
bins = df.groupby(df.x.round(round_by)).mean()
# plot
fig, ax = plt.subplots()
ax.scatter(df.x, df.y, s=1)
plt.plot(bins.index, bins.y, 'g--', lw=3)
plt.show()
我有一个大型的两列数据集,代表分散的功能行为。假设对于每个时间值 (x),存在一定数量的广泛传播的测量值 (y)。我想为每个时间值(或考虑特定时间间隔内的直方图)获取其中测量值 y 的平均值。我正在搜索 rolling/moving 平均值和样条插值,但我被卡住了。下面是应该发生什么的最小示例代码:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline
#generate testdata which usually is read in from a huge file
def testdata(x):
return 1/(1+10.*x**2)
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))
#convert it to dataframes, as I usually work with them
df = pd.DataFrame(list(zip(x,y)))
#sort the x-values as they are randomly distributed in the dataset
df_new = df.sort_values(by=[0])
#show the data and how the (analytical average shhould look like)
plt.scatter(df_new[0],df_new[1],s=1)
plt.scatter(df_new[0],testdata(df_new[0]), s=1, c='r')
#try a spline - however it fails
spl = UnivariateSpline(df_new.iloc[:, 0], df_new.iloc[:, 1])
xs = np.linspace(-1, 1, 10000)
plt.plot(xs, spl(xs), 'g--', lw=3)
plt.show()
所以蓝色是我的数据 - 红色是平均值的样子(在这个测试用例中我显然知道),绿色是样条方法给我的。
你们中肯定有人知道通过智能(内置)算法实现红色曲线的更好方法吗?
您可以 round
x
值然后 groupby
得到平均值。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline
# generate testdata
def testdata(x):
return 1/(1+10.*x**2)
# create x, y
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))
# convert to df and sort values inplace
df = pd.DataFrame({'x': x, 'y': y})
df.sort_values(by='x', inplace=True)
# round x values then group by to create bins
round_by = 2
bins = df.groupby(df.x.round(round_by)).mean()
# plot
fig, ax = plt.subplots()
ax.scatter(df.x, df.y, s=1)
plt.plot(bins.index, bins.y, 'g--', lw=3)
plt.show()