组合按列中的值分组的 2d numpy 数组
Combine a 2d numpy array grouped by values in a column
我有这个数组:
[['Burgundy Bichon Frise' '1' '137']
['Pumpkin Pomeranian' '1' '182']
['Purple Puffin' '1' '125']
['Wisteria Wombat' '1' '109']
['Burgundy Bichon Frise' '2' '168']
['Pumpkin Pomeranian' '2' '141']
['Purple Puffin' '2' '143']
['Wisteria Wombat' '2' '167']
['Burgundy Bichon Frise' '3' '154']
['Pumpkin Pomeranian' '3' '175']
['Purple Puffin' '3' '128']
['Wisteria Wombat' '3' '167']]
第一个索引包含动物的名称,第二个是它所在的地区,第三个是人口。我需要获得每个地区物种的平均值,并获得每个地区每个物种的最大值和最小值。所以 "Purple Puffins" 的平均值应该是 (125+143+128)/3 = 132.
我很困惑如何让 numpy 数组只计算每个地区的人口。
将这个二维数组分成多个二维数组会更好还是更容易?
这看起来更像是pandas的任务,我们可以先构造一个dataframe:
import pandas as pd
df = pd.DataFrame([
['Burgundy Bichon Frise','1','137'],
['Pumpkin Pomeranian','1','182'],
['Purple Puffin','1','125'],
['Wisteria Wombat','1','109'],
['Burgundy Bichon Frise','2','168'],
['Pumpkin Pomeranian','2','141'],
['Purple Puffin','2','143'],
['Wisteria Wombat','2','167'],
['Burgundy Bichon Frise','3','154'],
['Pumpkin Pomeranian','3','175'],
['Purple Puffin','3','128'],
['Wisteria Wombat','3','167']], columns=['animal', 'region', 'n'])
接下来我们可以将region
和n
转换为数字,这样计算统计会更容易:
df.region = pd.to_numeric(df.region)
df.n = pd.to_numeric(df.n)
最后我们可以执行一个.groupby(..)
然后计算一个聚合,比如:
>>> df[['animal', 'n']].groupby(('animal')).min()
n
animal
Burgundy Bichon Frise 137
Pumpkin Pomeranian 141
Purple Puffin 125
Wisteria Wombat 109
>>> df[['animal', 'n']].groupby(('animal')).max()
n
animal
Burgundy Bichon Frise 168
Pumpkin Pomeranian 182
Purple Puffin 143
Wisteria Wombat 167
>>> df[['animal', 'n']].groupby(('animal')).mean()
n
animal
Burgundy Bichon Frise 153.000000
Pumpkin Pomeranian 166.000000
Purple Puffin 132.000000
Wisteria Wombat 147.666667
编辑: 获取最小行per animal
我们可以使用 idxmin
/idxmax
获取 smallest/largest 行 per 动物的索引号,然后使用 df.iloc[..]
获取这些行,如:
>>> df.ix[df.groupby(('animal'))['n'].idxmin()]
animal region n
0 Burgundy Bichon Frise 1 137
5 Pumpkin Pomeranian 2 141
2 Purple Puffin 1 125
3 Wisteria Wombat 1 109
>>> df.ix[df.groupby(('animal'))['n'].idxmax()]
animal region n
4 Burgundy Bichon Frise 2 168
1 Pumpkin Pomeranian 1 182
6 Purple Puffin 2 143
7 Wisteria Wombat 2 167
此处 0, 5, 2, 3
(对于 idxmin
)是数据帧的 "row numbers"。
以下是如何使用 numpy 将数据 a
转换为 2D table:
>>> unqr, invr = np.unique(a[:, 0], return_inverse=True)
>>> unqc, invc = np.unique(a[:, 1], return_inverse=True)
# initialize with nans in case there are missing values
# these are then treated correctly by nanmean etc.:
>>> out = np.full((unqr.size, unqc.size), np.nan)
>>> out[invr, invc] = a[:, 2]
>>>
# now we have a table
>>> out
array([[137., 168., 154.],
[182., 141., 175.],
[125., 143., 128.],
[109., 167., 167.]])
# with rows
>>> unqr
array(['Burgundy Bichon Frise', 'Pumpkin Pomeranian', 'Purple Puffin',
'Wisteria Wombat'], dtype='<U21')
# and columns
>>> unqc
array(['1', '2', '3'], dtype='<U21')
>>>
# find the mean for 'Purple Puffin':
>>> np.nanmean(out[unqr.searchsorted('Purple Puffin')])
132.0
# find the max for region '2'
>>> np.nanmax(out[:, unqc.searchsorted('2')])
168.0
我有这个数组:
[['Burgundy Bichon Frise' '1' '137']
['Pumpkin Pomeranian' '1' '182']
['Purple Puffin' '1' '125']
['Wisteria Wombat' '1' '109']
['Burgundy Bichon Frise' '2' '168']
['Pumpkin Pomeranian' '2' '141']
['Purple Puffin' '2' '143']
['Wisteria Wombat' '2' '167']
['Burgundy Bichon Frise' '3' '154']
['Pumpkin Pomeranian' '3' '175']
['Purple Puffin' '3' '128']
['Wisteria Wombat' '3' '167']]
第一个索引包含动物的名称,第二个是它所在的地区,第三个是人口。我需要获得每个地区物种的平均值,并获得每个地区每个物种的最大值和最小值。所以 "Purple Puffins" 的平均值应该是 (125+143+128)/3 = 132.
我很困惑如何让 numpy 数组只计算每个地区的人口。
将这个二维数组分成多个二维数组会更好还是更容易?
这看起来更像是pandas的任务,我们可以先构造一个dataframe:
import pandas as pd
df = pd.DataFrame([
['Burgundy Bichon Frise','1','137'],
['Pumpkin Pomeranian','1','182'],
['Purple Puffin','1','125'],
['Wisteria Wombat','1','109'],
['Burgundy Bichon Frise','2','168'],
['Pumpkin Pomeranian','2','141'],
['Purple Puffin','2','143'],
['Wisteria Wombat','2','167'],
['Burgundy Bichon Frise','3','154'],
['Pumpkin Pomeranian','3','175'],
['Purple Puffin','3','128'],
['Wisteria Wombat','3','167']], columns=['animal', 'region', 'n'])
接下来我们可以将region
和n
转换为数字,这样计算统计会更容易:
df.region = pd.to_numeric(df.region)
df.n = pd.to_numeric(df.n)
最后我们可以执行一个.groupby(..)
然后计算一个聚合,比如:
>>> df[['animal', 'n']].groupby(('animal')).min()
n
animal
Burgundy Bichon Frise 137
Pumpkin Pomeranian 141
Purple Puffin 125
Wisteria Wombat 109
>>> df[['animal', 'n']].groupby(('animal')).max()
n
animal
Burgundy Bichon Frise 168
Pumpkin Pomeranian 182
Purple Puffin 143
Wisteria Wombat 167
>>> df[['animal', 'n']].groupby(('animal')).mean()
n
animal
Burgundy Bichon Frise 153.000000
Pumpkin Pomeranian 166.000000
Purple Puffin 132.000000
Wisteria Wombat 147.666667
编辑: 获取最小行per animal
我们可以使用 idxmin
/idxmax
获取 smallest/largest 行 per 动物的索引号,然后使用 df.iloc[..]
获取这些行,如:
>>> df.ix[df.groupby(('animal'))['n'].idxmin()]
animal region n
0 Burgundy Bichon Frise 1 137
5 Pumpkin Pomeranian 2 141
2 Purple Puffin 1 125
3 Wisteria Wombat 1 109
>>> df.ix[df.groupby(('animal'))['n'].idxmax()]
animal region n
4 Burgundy Bichon Frise 2 168
1 Pumpkin Pomeranian 1 182
6 Purple Puffin 2 143
7 Wisteria Wombat 2 167
此处 0, 5, 2, 3
(对于 idxmin
)是数据帧的 "row numbers"。
以下是如何使用 numpy 将数据 a
转换为 2D table:
>>> unqr, invr = np.unique(a[:, 0], return_inverse=True)
>>> unqc, invc = np.unique(a[:, 1], return_inverse=True)
# initialize with nans in case there are missing values
# these are then treated correctly by nanmean etc.:
>>> out = np.full((unqr.size, unqc.size), np.nan)
>>> out[invr, invc] = a[:, 2]
>>>
# now we have a table
>>> out
array([[137., 168., 154.],
[182., 141., 175.],
[125., 143., 128.],
[109., 167., 167.]])
# with rows
>>> unqr
array(['Burgundy Bichon Frise', 'Pumpkin Pomeranian', 'Purple Puffin',
'Wisteria Wombat'], dtype='<U21')
# and columns
>>> unqc
array(['1', '2', '3'], dtype='<U21')
>>>
# find the mean for 'Purple Puffin':
>>> np.nanmean(out[unqr.searchsorted('Purple Puffin')])
132.0
# find the max for region '2'
>>> np.nanmax(out[:, unqc.searchsorted('2')])
168.0