pandas 计算多列
pandas count over multiple columns
我有一个看起来像这样的数据框
Measure1 Measure2 Measure3 ...
0 1 3
1 3 2
3 0
我想计算要生成的列中值的出现次数:
Measure Count Percentage
0 2 0.25
1 2 0.25
2 1 0.125
3 3 0.373
有
outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True)
我只得到第一列(实际上使用 graphlab 包,但我更喜欢 pandas)
有人可以帮助我吗?
您可以通过使用 ravel
和 value_counts
展平 df 来生成计数,由此您可以构建最终的 df:
In [230]:
import io
import pandas as pd
t="""Measure1 Measure2 Measure3
0 1 3
1 3 2
3 0 0"""
df = pd.read_csv(io.StringIO(t), sep='\s+')
df
Out[230]:
Measure1 Measure2 Measure3
0 0 1 3
1 1 3 2
2 3 0 0
In [240]:
count = pd.Series(df.squeeze().values.ravel()).value_counts()
pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values})
Out[240]:
Count Measure Percentage
0 3 3 0.333333
1 3 0 0.333333
2 2 1 0.222222
3 1 2 0.111111
我插入了一个 0
只是为了使 df 形状正确,但你应该明白这一点
In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]})
In [69]: df
Out[69]:
m1 m2 m3
0 0 1 3.0
1 1 3 2.0
2 3 0 NaN
In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count')
In [71]: df
Out[71]:
Count
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [72]: df.index.name='Measure'
In [73]: df
Out[73]:
Count
Measure
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [74]: df['Percentage']=df.Count.div(df.Count.sum())
In [75]: df
Out[75]:
Count Percentage
Measure
0.0 2.0 0.250
1.0 2.0 0.250
2.0 1.0 0.125
3.0 3.0 0.375
我有一个看起来像这样的数据框
Measure1 Measure2 Measure3 ...
0 1 3
1 3 2
3 0
我想计算要生成的列中值的出现次数:
Measure Count Percentage
0 2 0.25
1 2 0.25
2 1 0.125
3 3 0.373
有
outcome_measure_count = cdss_data.groupby(key_columns=['Measure1'],operations={'count': agg.COUNT()}).sort('count', ascending=True)
我只得到第一列(实际上使用 graphlab 包,但我更喜欢 pandas)
有人可以帮助我吗?
您可以通过使用 ravel
和 value_counts
展平 df 来生成计数,由此您可以构建最终的 df:
In [230]:
import io
import pandas as pd
t="""Measure1 Measure2 Measure3
0 1 3
1 3 2
3 0 0"""
df = pd.read_csv(io.StringIO(t), sep='\s+')
df
Out[230]:
Measure1 Measure2 Measure3
0 0 1 3
1 1 3 2
2 3 0 0
In [240]:
count = pd.Series(df.squeeze().values.ravel()).value_counts()
pd.DataFrame({'Measure': count.index, 'Count':count.values, 'Percentage':(count/count.sum()).values})
Out[240]:
Count Measure Percentage
0 3 3 0.333333
1 3 0 0.333333
2 2 1 0.222222
3 1 2 0.111111
我插入了一个 0
只是为了使 df 形状正确,但你应该明白这一点
In [68]: df=DataFrame({'m1':[0,1,3], 'm2':[1,3,0], 'm3':[3,2, np.nan]})
In [69]: df
Out[69]:
m1 m2 m3
0 0 1 3.0
1 1 3 2.0
2 3 0 NaN
In [70]: df=df.apply(Series.value_counts).sum(1).to_frame(name='Count')
In [71]: df
Out[71]:
Count
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [72]: df.index.name='Measure'
In [73]: df
Out[73]:
Count
Measure
0.0 2.0
1.0 2.0
2.0 1.0
3.0 3.0
In [74]: df['Percentage']=df.Count.div(df.Count.sum())
In [75]: df
Out[75]:
Count Percentage
Measure
0.0 2.0 0.250
1.0 2.0 0.250
2.0 1.0 0.125
3.0 3.0 0.375