使用系数对 Pandas 数据帧重新采样
Resample a Pandas dataframe with coefficients
我有一个包含以下列的数据框:{'day','measurement'}
并且一天中可能有几次测量(或者根本没有测量)
例如:
day | measurement
1 | 20.1
1 | 20.9
3 | 19.2
4 | 20.0
4 | 20.2
和一组系数:
coef={-1:0.2, 0:0.6, 1:0.2}
我的目标是对数据重新采样并使用系数对其进行平均(缺失数据应被排除)。
这是我写的用来计算的代码
window=[-1,0,-1]
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
对于上面的例子,输出应该是:
Day measurement
1 20.500
2 19.850
3 19.425
4 19.875
问题是代码永远运行,我很确定有更好的方法用系数重新采样。
如有任何建议,我们将不胜感激!
以下是您正在寻找的可能的解决方案:
# This is your data
In [2]: data = pd.DataFrame({
...: 'day': [1, 1, 3, 4, 4],
...: 'measurement': [20.1, 20.9, 19.2, 20.0, 20.2]
...: })
# Pre-compute every day's average, filling the gaps
In [3]: measurement = data.groupby('day')['measurement'].mean()
In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1))
In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2})
# Create a matrix with the time-shifted measurements
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()})
In [7]: matrix
Out[7]:
-1 0 1
day
1 NaN 20.5 NaN
2 19.2 NaN 20.5
3 20.1 19.2 NaN
4 NaN 20.1 19.2
# Take a weighted average of the matrix
In [8]: (matrix * coef).sum(axis=1) / (matrix.notnull() * coef).sum(axis=1)
Out[8]:
day
1 20.500
2 19.850
3 19.425
4 19.875
dtype: float64
我有一个包含以下列的数据框:{'day','measurement'}
并且一天中可能有几次测量(或者根本没有测量)
例如:
day | measurement
1 | 20.1
1 | 20.9
3 | 19.2
4 | 20.0
4 | 20.2
和一组系数:
coef={-1:0.2, 0:0.6, 1:0.2}
我的目标是对数据重新采样并使用系数对其进行平均(缺失数据应被排除)。
这是我写的用来计算的代码
window=[-1,0,-1]
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum()
对于上面的例子,输出应该是:
Day measurement
1 20.500
2 19.850
3 19.425
4 19.875
问题是代码永远运行,我很确定有更好的方法用系数重新采样。
如有任何建议,我们将不胜感激!
以下是您正在寻找的可能的解决方案:
# This is your data
In [2]: data = pd.DataFrame({
...: 'day': [1, 1, 3, 4, 4],
...: 'measurement': [20.1, 20.9, 19.2, 20.0, 20.2]
...: })
# Pre-compute every day's average, filling the gaps
In [3]: measurement = data.groupby('day')['measurement'].mean()
In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1))
In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2})
# Create a matrix with the time-shifted measurements
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()})
In [7]: matrix
Out[7]:
-1 0 1
day
1 NaN 20.5 NaN
2 19.2 NaN 20.5
3 20.1 19.2 NaN
4 NaN 20.1 19.2
# Take a weighted average of the matrix
In [8]: (matrix * coef).sum(axis=1) / (matrix.notnull() * coef).sum(axis=1)
Out[8]:
day
1 20.500
2 19.850
3 19.425
4 19.875
dtype: float64