使用 Python 查找分散数据集的平均值（或滚动平均值）

Question

我有一个大型的两列数据集，代表分散的功能行为。假设对于每个时间值 (x)，存在一定数量的广泛传播的测量值 (y)。我想为每个时间值（或考虑特定时间间隔内的直方图）获取其中测量值 y 的平均值。我正在搜索 rolling/moving 平均值和样条插值，但我被卡住了。下面是应该发生什么的最小示例代码：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline

#generate testdata which usually is read in from a huge file
def testdata(x):
    return 1/(1+10.*x**2)

x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))

#convert it to dataframes, as I usually work with them
df = pd.DataFrame(list(zip(x,y)))

#sort the x-values as they are randomly distributed in the dataset
df_new = df.sort_values(by=[0])

#show the data and how the (analytical average shhould look like)
plt.scatter(df_new[0],df_new[1],s=1)
plt.scatter(df_new[0],testdata(df_new[0]), s=1, c='r')

#try a spline - however it fails
spl = UnivariateSpline(df_new.iloc[:, 0], df_new.iloc[:, 1])
xs = np.linspace(-1, 1, 10000)
plt.plot(xs, spl(xs), 'g--', lw=3)

plt.show()

所以蓝色是我的数据 - 红色是平均值的样子（在这个测试用例中我显然知道），绿色是样条方法给我的。

你们中肯定有人知道通过智能（内置）算法实现红色曲线的更好方法吗？

Answer 1

您可以 round x 值然后 groupby 得到平均值。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline

# generate testdata
def testdata(x):
    return 1/(1+10.*x**2)

# create x, y
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))

# convert to df and sort values inplace
df = pd.DataFrame({'x': x, 'y': y})
df.sort_values(by='x', inplace=True)

# round x values then group by to create bins
round_by = 2
bins = df.groupby(df.x.round(round_by)).mean()

# plot
fig, ax = plt.subplots()

ax.scatter(df.x, df.y, s=1)
plt.plot(bins.index, bins.y, 'g--', lw=3)

plt.show()

使用 Python 查找分散数据集的平均值（或滚动平均值）

Finding the mean value (or rolling average) of a scattered dataset with Python

python

numpy

average

pandas