在 pandas 中将多个具有相同类别的列分组为一个 table

Question

我有这样的数据集

Feature Name	Description	Data Type
customerID	Contains customer ID	unique ID, categorical, nominal
OnlineSecurity	Whether the customer has online security or not (Yes, No, No internet service)	categorical, nominal
OnlineBackup	Whether the customer has online backup or not (Yes, No, No internet service)	categorical, nominal
DeviceProtection	Whether the customer has device protection or not (Yes, No, No internet service)	categorical, nominal
TechSupport	Whether the customer has tech support or not (Yes, No, No internet service)	categorical, nominal
streamingTV	Whether the customer has streaming TV or not (Yes, No, No internet service)	categorical, nominal
streamingMovies	Whether the customer has streaming movies or not (Yes, No, No internet service)	categorical, nominal
Contract	The contract term of the customer (Month-to-month, One year, Two year)	categorical, nominal
PaperlessBilling	Whether the customer has paperless billing or not (Yes, No)	categorical, nominal
PaymentMethod	The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card)	categorical, nominal
MonthlyCharges	The amount charged to the customer monthly	numeric , float
TotalCharges	The total amount charged to the customer	numeric, float
Churn	Whether the customer churned or not (Yes or No)	categorical, nominal

数据集来自kaggle

如您所见，OnlineSecurity、OnlineBackup、DeviceProtection、TechSupport、streamingTV、StreamingMovies 具有相同的类别 ["是”、“否”、“无互联网服务”。我想对所有这些列进行分组，并得到如下预期结果：

	Yes	No	No internet service
OnlineSecurity	3497	1520	2015
DeviceProtection	3497	1520	2015
TechSupport	3497	1520	2015
streamingTV	3497	1520	2015
streamingTV	3497	1520	2015
StreamingMovies	3497	1520	2015

上面的数字 table 只是随机值，我希望它计算每一列的每个类别中的每个值

Answer 1

我在 link 中找不到您在上面发布的 table，但我想您已经找到了。我将它复制到 metadata 文件中。

# load the data into df and metadata
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')

然后您将必须获取具有相同类别的列。

cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()

这里我们检查列中有 Yes, No, No internet service 的行，这给了我们：['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

然后我选择将 df 与选定的列融合并将它们与 groupby 分组并计算值。

results = (df
    .melt(value_vars=cols)
    .groupby(['variable', 'value'])
    .agg({'value': 'count'})
    .unstack()
    .reset_index()
    .droplevel(level=0, axis=1)
)

这给你输出：

	No	No internet service	Yes
DeviceProtection	3095	1526	2422
OnlineBackup	3088	1526	2429
OnlineSecurity	3498	1526	2019
StreamingMovies	2785	1526	2732
StreamingTV	2810	1526	2707
TechSupport	3473	1526	2044

（下方评论要求总栏目）

results = (df
    .melt(value_vars=cols)
    .groupby(['variable', 'value'])
    .agg({'value': 'count'})
    .unstack()
    .reset_index()
    .droplevel(level=0, axis=1)
    .assign(total = lambda x: x.sum(axis=1))
)

在 pandas 中将多个具有相同类别的列分组为一个 table

Grouping several columns with the same category into one table in pandas

python

count

pandas

categorical-data

pandas-groupby