在 Python 中绘制 "stacked" 变量的密度分布，按 0 或 1 分类

Question

我有以下数据集：

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), columns = ['Var_1', 'Var_2', 'Var_3', 'Var_4', 'Var_5', 'Var_6']) 
df['Status'] = np.random.randint(0, 2, size=(100, 1))
df

Out[1]: 
    Var_1  Var_2  Var_3  Var_4  Var_5  Var_6  Status
0      32     65     48     83     60     21       1
1      44     49     65     84     52     34       1
2       9      2      3     14     82     80       1
3      66     90     97     60     28     12       0
4      28     95     64     53     39     30       1
..    ...    ...    ...    ...    ...    ...     ...
95     22      4     43      9     79     46       1
96     10     26     91     59     99     93       0
97     10     31     33     15     99     25       1
98     41     48     80     65     58     18       1
99     39     42     22     56     91     40       1

[100 rows x 7 columns]

如何创建每个变量的“堆叠”密度分布图，按 Status（0 或 1）分类。我希望情节看起来像这样：

此图是用 R 创建的。Python 中的图不必看起来完全一样。我可以使用什么代码来完成此操作？谢谢

Answer 1

这里是 seaborn 的 ridgeplot example 对给定结构的改编。这里 multiple='stack' 是在 sns.kdeplot 中选择的（默认是 multiple='layer' 从 y=0 开始绘制它们）。请注意，common_norm 默认为 True，它会根据样本数量按比例缩小两条曲线。

由于 seaborn 使用 "long form" 中的数据，pd.melt() 转换给定的数据帧。长格式如下：

      Status variable      value
0          0    Var 1  -0.961877
1          1    Var 1   6.454942
2          0    Var 1   6.020015
3          0    Var 1   7.094057
4          0    Var 1  10.289022
      ...      ...        ...
2995       0    Var 6  -5.718156
2996       0    Var 6  -5.142314
2997       0    Var 6  -5.155104
2998       1    Var 6   3.339401
2999       1    Var 6   7.912669

这是一个完整的代码示例：

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})

# Create the data
rs = np.random.RandomState(1979)
data = rs.randn(30, 100).cumsum(axis=1).reshape(-1, 6)
column_names = [f'Var {i}' for i in range(1, 7)]
df = pd.DataFrame(data, columns=column_names)
df['Status'] = rs.randint(0, 2, len(df))
for col in column_names:
    df.loc[df['Status'] == 1, col] += 5
df_long = df.melt(id_vars='Status', value_vars=column_names)

# Initialize the FacetGrid object
g = sns.FacetGrid(data=df_long, row="variable", aspect=6, height=1.8)

# Draw the densities
g.map_dataframe(sns.kdeplot, "value",
                bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5,
                hue="Status", hue_order=[0, 1], palette=['tomato', 'turquoise'], multiple='stack')
g.map(plt.axhline, y=0, lw=2, clip_on=False, color='black')

# Define and use a simple function to label the plot in axes coordinates
def label(x, color):
    ax = plt.gca()
    ax.text(0, .2, x.iloc[0], fontweight="bold", color='black',
            ha="left", va="center", transform=ax.transAxes)

g.map(label, "variable")

# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)

# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="")
g.despine(bottom=True, left=True)
plt.show()

在 Python 中绘制 "stacked" 变量的密度分布，按 0 或 1 分类

Plot "stacked" density distributions of variables, categorized by 0 or 1, in Python

python

distribution

matplotlib