在 pandas 中将多个具有相同类别的列分组为一个 table

Grouping several columns with the same category into one table in pandas

我有这样的数据集

Feature Name Description Data Type
customerID Contains customer ID unique ID, categorical, nominal
OnlineSecurity Whether the customer has online security or not (Yes, No, No internet service) categorical, nominal
OnlineBackup Whether the customer has online backup or not (Yes, No, No internet service) categorical, nominal
DeviceProtection Whether the customer has device protection or not (Yes, No, No internet service) categorical, nominal
TechSupport Whether the customer has tech support or not (Yes, No, No internet service) categorical, nominal
streamingTV Whether the customer has streaming TV or not (Yes, No, No internet service) categorical, nominal
streamingMovies Whether the customer has streaming movies or not (Yes, No, No internet service) categorical, nominal
Contract The contract term of the customer (Month-to-month, One year, Two year) categorical, nominal
PaperlessBilling Whether the customer has paperless billing or not (Yes, No) categorical, nominal
PaymentMethod The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card) categorical, nominal
MonthlyCharges The amount charged to the customer monthly    numeric , float
TotalCharges The total amount charged to the customer  numeric, float
Churn Whether the customer churned or not (Yes or No) categorical, nominal

数据集来自kaggle

如您所见,OnlineSecurityOnlineBackupDeviceProtectionTechSupportstreamingTVStreamingMovies 具有相同的类别 ["是”、“否”、“无互联网服务”。我想对所有这些列进行分组,并得到如下预期结果:

Yes No No internet service
OnlineSecurity 3497 1520 2015
DeviceProtection 3497 1520 2015
TechSupport 3497 1520 2015
streamingTV 3497 1520 2015
streamingTV 3497 1520 2015
StreamingMovies 3497 1520 2015

上面的数字 table 只是随机值,我希望它计算每一列的每个类别中的每个值

我在 link 中找不到您在上面发布的 table,但我想您已经找到了。我将它复制到 metadata 文件中。

# load the data into df and metadata
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')

然后您将必须获取具有相同类别的列。

cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()

这里我们检查列中有 Yes, No, No internet service 的行,这给了我们:['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

然后我选择将 df 与选定的列融合并将它们与 groupby 分组并计算值。

results = (df
    .melt(value_vars=cols)
    .groupby(['variable', 'value'])
    .agg({'value': 'count'})
    .unstack()
    .reset_index()
    .droplevel(level=0, axis=1)
)

这给你输出:

No No internet service Yes
DeviceProtection 3095 1526 2422
OnlineBackup 3088 1526 2429
OnlineSecurity 3498 1526 2019
StreamingMovies 2785 1526 2732
StreamingTV 2810 1526 2707
TechSupport 3473 1526 2044

(下方评论要求总栏目)

results = (df
    .melt(value_vars=cols)
    .groupby(['variable', 'value'])
    .agg({'value': 'count'})
    .unstack()
    .reset_index()
    .droplevel(level=0, axis=1)
    .assign(total = lambda x: x.sum(axis=1))
)