如何使用 pandas 聚合组指标和绘制数据

How to aggregate group metrics and plot data with pandas

我想要一个饼图比较幸存者的年龄组。问题是我不知道如何计算同龄人。正如您在屏幕截图底部看到的那样,它表示 142 列。但是,数据集中有 891 人。

import pandas as pd
import seaborn as sns  # for test data only

# load test data from seaborn
df_t = sns.load_dataset('titanic')

# capitalize the column headers to match code used below
df_t.columns = df_t.columns.str.title()

dft = df_t.groupby(['Age', 'Survived']).size().reset_index(name='count')

def get_num_people_by_age_category(dft):
    dft["age_group"] = pd.cut(x=dft['Age'], bins=[0,18,60,100], labels=["young","middle_aged","old"])
    return dft

# Call function
dft = get_num_people_by_age_category(dft)
print(dft)

输出

调用 df_t.groupby(['Age', 'Survived']).size().reset_index(name='count') 创建一个数据框,每个年龄和每个幸存状态一行。

要获取每个年龄组的计数,可以将“年龄组”列添加到原始数据框中。在下一步中,groupby 可以使用该“年龄组”。

from matplotlib import pyplot as plt
import seaborn as sns  # to load the titanic dataset
import pandas as pd

df_t = sns.load_dataset('titanic')
df_t["age_group"] = pd.cut(x=df_t['age'], bins=[0, 18, 60, 100], labels=["young", "middle aged", "old"])

df_per_age = df_t.groupby(['age_group', 'survived']).size().reset_index(name='count')
labels = [f'{age_group},\n {"survived" if survived == 1 else "not survived"}'
          for age_group, survived in df_per_age[['age_group', 'survived']].values]
labels[-1] = labels[-1].replace('\n', ' ') # remove newline for the last items as the wedges are too thin
labels[-2] = labels[-2].replace('\n', ' ')
plt.pie(df_per_age['count'], labels=labels)
plt.tight_layout()
plt.show()

  • from @JohanC 非常适合饼图
  • 我认为数据最好以条形图的形式呈现,所以这是一个替代方案,可以使用 pandas.DataFrame.plotkind='bar'.
  • 使用 pandas.crosstab 重塑数据,这会在两个因素之间创建频率交叉表 table。
  • 可选择使用 matplotlib.pyplot.bar_label 包含条形注释
    • 有关此方法的更多详细信息,请参阅此 answer
import pandas as pd
import seaborn as sns

# load data
df = sns.load_dataset('titanic')
df.columns = df.columns.str.title()

# map 0 and 1 of Survived to a string
df.Survived = df.Survived.map({0: 'Died', 1: 'Survived'})

# bin the age
df['Age Group'] = pd.cut(x=df['Age'], bins=[0, 18, 60, 100], labels=['Young', 'Middle Aged', 'Senior'])

# Calculate the counts
ct = pd.crosstab(df['Survived'], df['Age Group'])

# display(ct)
Age Group  Young  Middle Aged  Senior
Survived                             
Died          69          338      17
Survived      70          215       5

# plot
ax = ct.plot(kind='bar', rot=0, xlabel='')

# optionally add annotations
for c in ax.containers:
    ax.bar_label(c, label_type='edge')
    
# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)