平均分箱中的数据
Averaging Data in Bins
我有两个表:一个是深度表,一个是叶绿素表,一一对应。我想平均每 0.5 米深度的叶绿素数据。
chl = [0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33]
depth = [0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3]
深度箱的长度并不总是相等,也不总是从 0.0 或 0.5 间隔开始。不过,叶绿素数据始终与深度数据协调。叶绿素平均值也不能按升序排列,它们需要根据深度保持正确的顺序。深度和叶绿素列表很长,所以我无法单独完成。
我如何制作 0.5 米深度的容器,其中包含平均叶绿素数据?
目标:
depth = [0.5,1.0,1.5,2.0,2.5]
chlorophyll = [avg1,avg2,avg3,avg4,avg5]
例如:
avg1 = np.mean(0.4,0.1,0.04,0.05,0.4)
一种方法是使用 numpy.digitize
对类别进行分类。
然后使用字典或列表理解来计算结果。
import numpy as np
chl = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])
depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])
bins = np.array([0,0.5,1.0,1.5,2.0,2.5])
A = np.vstack((np.digitize(depth, bins), chl)).T
res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}
# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}
或者您想要的精确格式:
res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]
# [nan, 0.198, nan, 0.28, 0.355, 0.265]
这是 pandas.cut
的一种方式
df=pd.DataFrame({'chl':chl,'depth':depth})
df.groupby(pd.cut(df.depth,bins=[0,0.5,1,1.5,2,2.5])).chl.mean()
Out[456]:
depth
(0.0, 0.5] 0.198
(0.5, 1.0] NaN
(1.0, 1.5] 0.280
(1.5, 2.0] 0.355
(2.0, 2.5] 0.265
Name: chl, dtype: float64
这是一个矢量化的 NumPy 解决方案,使用 np.searchsorted
for getting the bin shifts (indices) and np.add.reduceat
进行分箱求和 -
def bin_data(chl, depth, bin_start=0, bin_length= 0.5):
# Get number of intervals and hence the bin-length-spaced depth array
n = int(np.ceil(depth[-1]/bin_length))
depthl = np.linspace(start=bin_start,stop=bin_length*n, num=n+1)
# Indices along depth array where the intervaled array would have bin shifts
idx = np.searchsorted(depth, depthl)
# Number of elements in each bin (bin-lengths)
lens = np.diff(idx)
# Get summations for each bins & divide by bin lengths for binned avg o/p
# For bins with lengths==0, set them as some invalid specifier, say NaN
return np.where(lens==0, np.nan, np.add.reduceat(chl, idx[:-1])/lens)
样本运行-
In [83]: chl
Out[83]:
array([0.4 , 0.1 , 0.04, 0.05, 0.4 , 0.2 , 0.6 , 0.09, 0.23, 0.43, 0.65,
0.22, 0.12, 0.2 , 0.33])
In [84]: depth
Out[84]:
array([0.1 , 0.3 , 0.31 , 0.44 , 0.49 , 1.1 , 1.145, 1.33 , 1.49 ,
1.53 , 1.67 , 1.79 , 1.87 , 2.1 , 2.3 ])
In [85]: bin_data(chl, depth, bin_start=0, bin_length= 0.5)
Out[85]: array([0.198, nan, 0.28 , 0.355, 0.265])
我很惊讶 scipy.stats.binned_statistic
还没有被提及。您可以直接用它计算平均值,并使用可选参数指定 bins。
from scipy.stats import binned_statistic
mean_stat = binned_statistic(depth, chl,
statistic='mean',
bins=5,
range=(0, 2.5))
mean_stat.statistic
# array([0.198, nan, 0.28 , 0.355, 0.265])
mean_stat.bin_edges
# array([0. , 0.5, 1. , 1.5, 2. , 2.5])
mean_stat.binnumber
# array([1, 1, 1, ..., 4, 5, 5])
我有两个表:一个是深度表,一个是叶绿素表,一一对应。我想平均每 0.5 米深度的叶绿素数据。
chl = [0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33]
depth = [0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3]
深度箱的长度并不总是相等,也不总是从 0.0 或 0.5 间隔开始。不过,叶绿素数据始终与深度数据协调。叶绿素平均值也不能按升序排列,它们需要根据深度保持正确的顺序。深度和叶绿素列表很长,所以我无法单独完成。
我如何制作 0.5 米深度的容器,其中包含平均叶绿素数据?
目标:
depth = [0.5,1.0,1.5,2.0,2.5]
chlorophyll = [avg1,avg2,avg3,avg4,avg5]
例如:
avg1 = np.mean(0.4,0.1,0.04,0.05,0.4)
一种方法是使用 numpy.digitize
对类别进行分类。
然后使用字典或列表理解来计算结果。
import numpy as np
chl = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])
depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])
bins = np.array([0,0.5,1.0,1.5,2.0,2.5])
A = np.vstack((np.digitize(depth, bins), chl)).T
res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}
# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}
或者您想要的精确格式:
res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]
# [nan, 0.198, nan, 0.28, 0.355, 0.265]
这是 pandas.cut
df=pd.DataFrame({'chl':chl,'depth':depth})
df.groupby(pd.cut(df.depth,bins=[0,0.5,1,1.5,2,2.5])).chl.mean()
Out[456]:
depth
(0.0, 0.5] 0.198
(0.5, 1.0] NaN
(1.0, 1.5] 0.280
(1.5, 2.0] 0.355
(2.0, 2.5] 0.265
Name: chl, dtype: float64
这是一个矢量化的 NumPy 解决方案,使用 np.searchsorted
for getting the bin shifts (indices) and np.add.reduceat
进行分箱求和 -
def bin_data(chl, depth, bin_start=0, bin_length= 0.5):
# Get number of intervals and hence the bin-length-spaced depth array
n = int(np.ceil(depth[-1]/bin_length))
depthl = np.linspace(start=bin_start,stop=bin_length*n, num=n+1)
# Indices along depth array where the intervaled array would have bin shifts
idx = np.searchsorted(depth, depthl)
# Number of elements in each bin (bin-lengths)
lens = np.diff(idx)
# Get summations for each bins & divide by bin lengths for binned avg o/p
# For bins with lengths==0, set them as some invalid specifier, say NaN
return np.where(lens==0, np.nan, np.add.reduceat(chl, idx[:-1])/lens)
样本运行-
In [83]: chl
Out[83]:
array([0.4 , 0.1 , 0.04, 0.05, 0.4 , 0.2 , 0.6 , 0.09, 0.23, 0.43, 0.65,
0.22, 0.12, 0.2 , 0.33])
In [84]: depth
Out[84]:
array([0.1 , 0.3 , 0.31 , 0.44 , 0.49 , 1.1 , 1.145, 1.33 , 1.49 ,
1.53 , 1.67 , 1.79 , 1.87 , 2.1 , 2.3 ])
In [85]: bin_data(chl, depth, bin_start=0, bin_length= 0.5)
Out[85]: array([0.198, nan, 0.28 , 0.355, 0.265])
我很惊讶 scipy.stats.binned_statistic
还没有被提及。您可以直接用它计算平均值,并使用可选参数指定 bins。
from scipy.stats import binned_statistic
mean_stat = binned_statistic(depth, chl,
statistic='mean',
bins=5,
range=(0, 2.5))
mean_stat.statistic
# array([0.198, nan, 0.28 , 0.355, 0.265])
mean_stat.bin_edges
# array([0. , 0.5, 1. , 1.5, 2. , 2.5])
mean_stat.binnumber
# array([1, 1, 1, ..., 4, 5, 5])