如何可视化分类频率差异
How to visualize categorical frequency difference
数据:在此处找到的糖尿病数据集:https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv
Objective:我想调查有多少 30 岁以下的人患有糖尿病,在“结果”列中用 1 或 0 表示数据集并绘制它以查看是否存在 class 不平衡(大于 1 或大于 2 或大致相等?)
方法:
- 这样过滤我的数据集:
ages_under30 = data.loc[data.Age < 30].loc[:,["Age"]]
outcome_under30 = data.loc[data.Age < 30].loc[:,["Outcome"]]
这成功returns所有30岁以下的人,结果是什么(0或1)。
- 我想绘制这些点以查看 class 表示的样子。是否有某些年龄段的人更容易患糖尿病? X 轴为“ages_under30”,Y 轴为“outcome_under30”。
plt.grid()
plt.xlabel("Age")
plt.ylabel("Diabetic?")
plt.plot(age_under30, outcome_under30, "o")
见上图。这是我需要帮助的地方。您真的无法对此做出正面或反面的判断。这个年龄组存在 class 不平衡——事实上 312 个样本没有糖尿病,而只有 84 个样本。我如何调整情节以更好地描绘这种 class 不平衡?
- 每个
'Age'
的 'Outcome'
差异最容易通过显示计数的条形图看出,这可以直接使用 seaborn.countplot
或计算计数在 pandas 中,并用 pandas.DataFrmame.plot
绘图。
- 测试于
python 3.8.12
、pandas 1.3.3
、matplotlib 3.4.3
、seaborn 0.11.2
数据和导入
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# data
df = pd.read_csv('https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv')
# filter for less than 30
u30 = df[df.Age.lt(30)]
使用seaborn.coutplot
- 使用条形直接显示每个分类 bin 中的观察计数。
- 这也可以用
seaborn.catplot
和 kind='count'
来创建图形级别的图
sns.countplot(data=u30, x='Age', hue='Outcome')
使用pandas.crosstab
and pandas.DataFrame.plot
- 使用
.crosstab
计算 'Age'
和 'Outcome'
之间的频率 table。
- 这也可以通过 groupby 来完成,但是数据框需要进一步操作才能绘图。
# reshape the dataframe
ct = pd.crosstab(u30.Age, u30.Outcome)
# plot
ct.plot(kind='bar', rot=0)
数据
- 如果 GitHub link 处的数据不再可用
Age,Outcome
21,0
26,1
29,0
27,0
29,1
22,0
28,1
22,0
28,0
27,1
26,0
25,1
29,0
22,0
24,0
22,0
26,0
21,0
22,0
21,0
24,0
25,0
27,0
28,1
26,0
23,0
22,0
22,0
27,0
26,1
24,0
22,0
22,0
22,0
27,0
26,0
24,0
21,0
21,0
24,0
22,0
23,0
22,0
21,0
24,0
27,0
21,0
27,0
25,0
24,1
24,1
23,0
25,0
25,0
22,0
21,0
25,1
24,0
23,0
23,1
26,1
23,0
26,0
21,0
22,0
29,0
28,0
22,0
23,0
21,0
22,0
24,0
23,0
21,0
23,0
22,0
27,0
21,0
22,0
29,0
29,0
29,1
25,0
23,0
26,1
23,0
21,0
27,0
25,1
21,0
29,1
21,0
23,1
26,1
29,1
21,0
28,0
27,0
27,0
21,0
25,0
24,0
24,1
25,1
21,1
26,0
22,0
26,0
24,1
24,0
22,1
22,0
29,0
23,0
26,1
23,1
27,0
21,0
22,0
22,1
29,0
23,0
23,0
27,0
24,0
25,0
21,1
25,0
24,0
27,1
24,0
25,1
24,0
21,0
28,1
21,0
21,0
25,0
29,1
23,0
22,0
28,1
29,1
26,0
21,0
25,1
24,1
28,0
29,1
24,0
25,1
28,1
29,0
21,0
25,1
22,0
27,1
25,0
26,0
29,1
28,0
25,1
21,0
24,0
23,1
25,0
22,0
26,0
22,0
22,0
22,0
23,0
26,0
29,0
24,0
21,0
28,1
29,1
29,1
29,1
21,0
22,0
25,1
21,0
21,0
25,0
28,0
22,0
22,0
24,0
22,0
21,0
25,0
25,0
24,0
28,0
27,1
21,0
25,0
22,1
25,0
25,1
26,0
25,0
28,1
28,0
25,0
22,0
21,0
21,1
22,1
22,0
27,0
28,1
26,0
21,0
21,0
21,0
25,0
26,0
23,0
22,0
29,0
29,1
28,0
21,0
22,0
24,0
25,1
28,0
26,0
22,1
26,0
23,0
23,1
25,0
24,0
24,0
26,0
21,0
22,0
25,0
27,0
28,0
22,0
22,0
24,0
29,1
29,0
28,0
23,0
24,1
21,0
28,0
24,0
22,0
25,0
21,0
28,0
21,0
21,0
21,0
22,0
24,0
28,1
25,0
26,0
26,0
24,0
21,0
21,0
24,0
22,0
22,0
24,0
29,0
24,0
23,1
23,0
27,1
25,0
29,0
28,0
21,0
25,0
23,0
28,0
28,1
24,0
27,0
22,0
21,0
21,0
22,0
22,0
23,0
25,0
21,1
21,1
27,0
22,0
29,0
25,0
24,0
25,0
22,1
21,0
26,0
24,0
28,0
21,0
22,1
25,0
27,0
23,0
24,0
26,0
27,0
23,0
24,1
28,0
28,0
21,0
21,0
29,0
21,0
21,0
21,0
24,0
23,0
22,0
23,0
28,0
27,0
24,0
27,0
22,1
23,0
23,0
27,0
28,0
27,0
22,0
25,1
22,0
27,1
22,1
24,0
21,0
22,0
25,0
25,1
23,0
22,0
26,1
22,0
27,1
25,0
22,0
29,0
23,0
23,0
25,0
22,0
28,0
26,0
26,0
27,0
28,0
22,0
23,1
24,0
21,0
24,0
21,0
25,0
22,0
22,0
22,0
22,1
24,1
22,0
28,0
21,0
21,0
26,0
22,0
27,1
22,1
28,0
25,0
26,1
26,0
22,0
27,0
23,0
数据:在此处找到的糖尿病数据集:https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv
Objective:我想调查有多少 30 岁以下的人患有糖尿病,在“结果”列中用 1 或 0 表示数据集并绘制它以查看是否存在 class 不平衡(大于 1 或大于 2 或大致相等?)
方法:
- 这样过滤我的数据集:
ages_under30 = data.loc[data.Age < 30].loc[:,["Age"]]
outcome_under30 = data.loc[data.Age < 30].loc[:,["Outcome"]]
这成功returns所有30岁以下的人,结果是什么(0或1)。
- 我想绘制这些点以查看 class 表示的样子。是否有某些年龄段的人更容易患糖尿病? X 轴为“ages_under30”,Y 轴为“outcome_under30”。
plt.grid()
plt.xlabel("Age")
plt.ylabel("Diabetic?")
plt.plot(age_under30, outcome_under30, "o")
见上图。这是我需要帮助的地方。您真的无法对此做出正面或反面的判断。这个年龄组存在 class 不平衡——事实上 312 个样本没有糖尿病,而只有 84 个样本。我如何调整情节以更好地描绘这种 class 不平衡?
- 每个
'Age'
的'Outcome'
差异最容易通过显示计数的条形图看出,这可以直接使用seaborn.countplot
或计算计数在 pandas 中,并用pandas.DataFrmame.plot
绘图。 - 测试于
python 3.8.12
、pandas 1.3.3
、matplotlib 3.4.3
、seaborn 0.11.2
数据和导入
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# data
df = pd.read_csv('https://raw.githubusercontent.com/LahiruTjay/Machine-Learning-With-Python/master/datasets/diabetes.csv')
# filter for less than 30
u30 = df[df.Age.lt(30)]
使用seaborn.coutplot
- 使用条形直接显示每个分类 bin 中的观察计数。
- 这也可以用
seaborn.catplot
和kind='count'
来创建图形级别的图
sns.countplot(data=u30, x='Age', hue='Outcome')
使用pandas.crosstab
and pandas.DataFrame.plot
- 使用
.crosstab
计算'Age'
和'Outcome'
之间的频率 table。- 这也可以通过 groupby 来完成,但是数据框需要进一步操作才能绘图。
# reshape the dataframe
ct = pd.crosstab(u30.Age, u30.Outcome)
# plot
ct.plot(kind='bar', rot=0)
数据
- 如果 GitHub link 处的数据不再可用
Age,Outcome
21,0
26,1
29,0
27,0
29,1
22,0
28,1
22,0
28,0
27,1
26,0
25,1
29,0
22,0
24,0
22,0
26,0
21,0
22,0
21,0
24,0
25,0
27,0
28,1
26,0
23,0
22,0
22,0
27,0
26,1
24,0
22,0
22,0
22,0
27,0
26,0
24,0
21,0
21,0
24,0
22,0
23,0
22,0
21,0
24,0
27,0
21,0
27,0
25,0
24,1
24,1
23,0
25,0
25,0
22,0
21,0
25,1
24,0
23,0
23,1
26,1
23,0
26,0
21,0
22,0
29,0
28,0
22,0
23,0
21,0
22,0
24,0
23,0
21,0
23,0
22,0
27,0
21,0
22,0
29,0
29,0
29,1
25,0
23,0
26,1
23,0
21,0
27,0
25,1
21,0
29,1
21,0
23,1
26,1
29,1
21,0
28,0
27,0
27,0
21,0
25,0
24,0
24,1
25,1
21,1
26,0
22,0
26,0
24,1
24,0
22,1
22,0
29,0
23,0
26,1
23,1
27,0
21,0
22,0
22,1
29,0
23,0
23,0
27,0
24,0
25,0
21,1
25,0
24,0
27,1
24,0
25,1
24,0
21,0
28,1
21,0
21,0
25,0
29,1
23,0
22,0
28,1
29,1
26,0
21,0
25,1
24,1
28,0
29,1
24,0
25,1
28,1
29,0
21,0
25,1
22,0
27,1
25,0
26,0
29,1
28,0
25,1
21,0
24,0
23,1
25,0
22,0
26,0
22,0
22,0
22,0
23,0
26,0
29,0
24,0
21,0
28,1
29,1
29,1
29,1
21,0
22,0
25,1
21,0
21,0
25,0
28,0
22,0
22,0
24,0
22,0
21,0
25,0
25,0
24,0
28,0
27,1
21,0
25,0
22,1
25,0
25,1
26,0
25,0
28,1
28,0
25,0
22,0
21,0
21,1
22,1
22,0
27,0
28,1
26,0
21,0
21,0
21,0
25,0
26,0
23,0
22,0
29,0
29,1
28,0
21,0
22,0
24,0
25,1
28,0
26,0
22,1
26,0
23,0
23,1
25,0
24,0
24,0
26,0
21,0
22,0
25,0
27,0
28,0
22,0
22,0
24,0
29,1
29,0
28,0
23,0
24,1
21,0
28,0
24,0
22,0
25,0
21,0
28,0
21,0
21,0
21,0
22,0
24,0
28,1
25,0
26,0
26,0
24,0
21,0
21,0
24,0
22,0
22,0
24,0
29,0
24,0
23,1
23,0
27,1
25,0
29,0
28,0
21,0
25,0
23,0
28,0
28,1
24,0
27,0
22,0
21,0
21,0
22,0
22,0
23,0
25,0
21,1
21,1
27,0
22,0
29,0
25,0
24,0
25,0
22,1
21,0
26,0
24,0
28,0
21,0
22,1
25,0
27,0
23,0
24,0
26,0
27,0
23,0
24,1
28,0
28,0
21,0
21,0
29,0
21,0
21,0
21,0
24,0
23,0
22,0
23,0
28,0
27,0
24,0
27,0
22,1
23,0
23,0
27,0
28,0
27,0
22,0
25,1
22,0
27,1
22,1
24,0
21,0
22,0
25,0
25,1
23,0
22,0
26,1
22,0
27,1
25,0
22,0
29,0
23,0
23,0
25,0
22,0
28,0
26,0
26,0
27,0
28,0
22,0
23,1
24,0
21,0
24,0
21,0
25,0
22,0
22,0
22,0
22,1
24,1
22,0
28,0
21,0
21,0
26,0
22,0
27,1
22,1
28,0
25,0
26,1
26,0
22,0
27,0
23,0