Python Pandas 中分类变量的描述性统计

Question

我在 Python Pandas:

中有一个分类变量数据框

import pandas as pd
cat_df = pd.DataFrame({'item':['bed',  'lamp', 'candle', 'chair', 'bed',  'candle', 'lamp'],
                  'location' :['home', 'home', 'home',   'home',  'home', 'home',   'home' ],
                  'status'   :['new',  'used', 'used',   'new',   'new',  'used',   'new' ]})

cat_df = cat_df.astype('category')
print(cat_df.dtypes)

cat_df.describe().transpose()

我需要一个显示所有分类变量值以及频率和百分比的输出

# item     bed    2 28.57
# item     lamp   2 28.57
# item     candle 2 28.57
# item     chair  1 14.28
# location home   7 100
# status   new    4 57.14
# status   used   3 42.85

真实的数据集有很多变量，所以我需要一个可扩展的解决方案。

Answer 1

使用DataFrame.melt with GroupBy.size for counts and then divide by sum to new column by GroupBy.transform:

df1 = cat_df.melt(value_name='v', var_name='c')
df = df1.groupby(['c', 'v']).size().reset_index(name='count')
df['perc'] = df['count'].div(df.groupby('c')['count'].transform('sum')).mul(100)
print (df)
          c       v  count        perc
0      item     bed      2   28.571429
1      item  candle      2   28.571429
2      item   chair      1   14.285714
3      item    lamp      2   28.571429
4  location    home      7  100.000000
5    status     new      4   57.142857
6    status    used      3   42.857143

Python Pandas 中分类变量的描述性统计

Descriptive statistics for categorical variables in Python Pandas

python

pandas

categorical-data

data-wrangling