在 pandas 中将多个具有相同类别的列分组为一个 table
Grouping several columns with the same category into one table in pandas
我有这样的数据集
Feature Name
Description
Data Type
customerID
Contains customer ID
unique ID, categorical, nominal
OnlineSecurity
Whether the customer has online security or not (Yes, No, No internet service)
categorical, nominal
OnlineBackup
Whether the customer has online backup or not (Yes, No, No internet service)
categorical, nominal
DeviceProtection
Whether the customer has device protection or not (Yes, No, No internet service)
categorical, nominal
TechSupport
Whether the customer has tech support or not (Yes, No, No internet service)
categorical, nominal
streamingTV
Whether the customer has streaming TV or not (Yes, No, No internet service)
categorical, nominal
streamingMovies
Whether the customer has streaming movies or not (Yes, No, No internet service)
categorical, nominal
Contract
The contract term of the customer (Month-to-month, One year, Two year)
categorical, nominal
PaperlessBilling
Whether the customer has paperless billing or not (Yes, No)
categorical, nominal
PaymentMethod
The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card)
categorical, nominal
MonthlyCharges
The amount charged to the customer monthly
numeric , float
TotalCharges
The total amount charged to the customer
numeric, float
Churn
Whether the customer churned or not (Yes or No)
categorical, nominal
数据集来自kaggle
如您所见,OnlineSecurity
、OnlineBackup
、DeviceProtection
、TechSupport
、streamingTV
、StreamingMovies
具有相同的类别 ["是”、“否”、“无互联网服务”。我想对所有这些列进行分组,并得到如下预期结果:
Yes
No
No internet service
OnlineSecurity
3497
1520
2015
DeviceProtection
3497
1520
2015
TechSupport
3497
1520
2015
streamingTV
3497
1520
2015
streamingTV
3497
1520
2015
StreamingMovies
3497
1520
2015
上面的数字 table 只是随机值,我希望它计算每一列的每个类别中的每个值
我在 link 中找不到您在上面发布的 table,但我想您已经找到了。我将它复制到 metadata
文件中。
# load the data into df and metadata
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')
然后您将必须获取具有相同类别的列。
cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()
这里我们检查列中有 Yes, No, No internet service
的行,这给了我们:['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
然后我选择将 df
与选定的列融合并将它们与 groupby
分组并计算值。
results = (df
.melt(value_vars=cols)
.groupby(['variable', 'value'])
.agg({'value': 'count'})
.unstack()
.reset_index()
.droplevel(level=0, axis=1)
)
这给你输出:
No
No internet service
Yes
DeviceProtection
3095
1526
2422
OnlineBackup
3088
1526
2429
OnlineSecurity
3498
1526
2019
StreamingMovies
2785
1526
2732
StreamingTV
2810
1526
2707
TechSupport
3473
1526
2044
(下方评论要求总栏目)
results = (df
.melt(value_vars=cols)
.groupby(['variable', 'value'])
.agg({'value': 'count'})
.unstack()
.reset_index()
.droplevel(level=0, axis=1)
.assign(total = lambda x: x.sum(axis=1))
)
我有这样的数据集
Feature Name | Description | Data Type |
---|---|---|
customerID | Contains customer ID | unique ID, categorical, nominal |
OnlineSecurity | Whether the customer has online security or not (Yes, No, No internet service) | categorical, nominal |
OnlineBackup | Whether the customer has online backup or not (Yes, No, No internet service) | categorical, nominal |
DeviceProtection | Whether the customer has device protection or not (Yes, No, No internet service) | categorical, nominal |
TechSupport | Whether the customer has tech support or not (Yes, No, No internet service) | categorical, nominal |
streamingTV | Whether the customer has streaming TV or not (Yes, No, No internet service) | categorical, nominal |
streamingMovies | Whether the customer has streaming movies or not (Yes, No, No internet service) | categorical, nominal |
Contract | The contract term of the customer (Month-to-month, One year, Two year) | categorical, nominal |
PaperlessBilling | Whether the customer has paperless billing or not (Yes, No) | categorical, nominal |
PaymentMethod | The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card) | categorical, nominal |
MonthlyCharges | The amount charged to the customer monthly | numeric , float |
TotalCharges | The total amount charged to the customer | numeric, float |
Churn | Whether the customer churned or not (Yes or No) | categorical, nominal |
数据集来自kaggle
如您所见,OnlineSecurity
、OnlineBackup
、DeviceProtection
、TechSupport
、streamingTV
、StreamingMovies
具有相同的类别 ["是”、“否”、“无互联网服务”。我想对所有这些列进行分组,并得到如下预期结果:
Yes | No | No internet service | |
---|---|---|---|
OnlineSecurity | 3497 | 1520 | 2015 |
DeviceProtection | 3497 | 1520 | 2015 |
TechSupport | 3497 | 1520 | 2015 |
streamingTV | 3497 | 1520 | 2015 |
streamingTV | 3497 | 1520 | 2015 |
StreamingMovies | 3497 | 1520 | 2015 |
上面的数字 table 只是随机值,我希望它计算每一列的每个类别中的每个值
我在 link 中找不到您在上面发布的 table,但我想您已经找到了。我将它复制到 metadata
文件中。
# load the data into df and metadata
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
metadata = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_meta.csv')
然后您将必须获取具有相同类别的列。
cols = metadata.loc[metadata['Description'].str.contains('Yes, No, No internet service')]['Feature Name'].tolist()
这里我们检查列中有 Yes, No, No internet service
的行,这给了我们:['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
然后我选择将 df
与选定的列融合并将它们与 groupby
分组并计算值。
results = (df
.melt(value_vars=cols)
.groupby(['variable', 'value'])
.agg({'value': 'count'})
.unstack()
.reset_index()
.droplevel(level=0, axis=1)
)
这给你输出:
No | No internet service | Yes | |
---|---|---|---|
DeviceProtection | 3095 | 1526 | 2422 |
OnlineBackup | 3088 | 1526 | 2429 |
OnlineSecurity | 3498 | 1526 | 2019 |
StreamingMovies | 2785 | 1526 | 2732 |
StreamingTV | 2810 | 1526 | 2707 |
TechSupport | 3473 | 1526 | 2044 |
(下方评论要求总栏目)
results = (df
.melt(value_vars=cols)
.groupby(['variable', 'value'])
.agg({'value': 'count'})
.unstack()
.reset_index()
.droplevel(level=0, axis=1)
.assign(total = lambda x: x.sum(axis=1))
)