在宽 Pandas DataFrame 中将 3 个文本列折叠为 1 个

Collapse 3 Text Colums into 1 within a wide Pandas DataFrame

我有一个数据集,其中一种数据类型分布在多个列中。我想将这些减少到一个列。我有一个函数可以完成此操作,但这是一个繁琐的过程,我希望有一种更简洁的方法来完成此操作。这是我的数据的玩具样本:

UID    COMPANY    EML    MAI   TEL
273    7UP        nan    nan   TEL
273    7UP        nan    MAI   nan
906    WSJ        nan    nan   TEL
906    WSJ        EML    nan   nan
736    AIG        nan    MAI   nan

我想去的地方:

UID    COMPANY   CONTACT_INFO
273    7UP       MT
906    WSJ       ET
736    AIG       M

我已经通过编写一个将 EMLMAITEL 转换为素数、聚合结果然后将总和转换为组成联系人类型的函数来解决此问题.这有效,而且相当快。这是一个示例:

def columnRedux(df):
    newDF = df.copy()
    newDF.fillna('-', inplace=True)
    newDF['CONTACT_INFO'] = newDF['EML'] + newDF['MAI'] + newDF['TEL']
    newDF.replace('EML--', 7, inplace=True)
    newDF.replace('-MAI-', 101, inplace=True)
    newDF.replace('--TEL', 1009, inplace=True)

    small = newDF.groupby(['UID', 'COMPANY'], as_index=False)['CONTACT_INFO'].sum()

    small.replace(7, 'E', inplace=True)
    small.replace(101, 'M', inplace=True)
    small.replace(108, 'EM', inplace=True)
    small.replace(1009, 'T', inplace=True)
    small.replace(1016, 'ET', inplace=True)
    small.replace(1110, 'MT', inplace=True)
    small.replace(1117, 'EMT', inplace=True)

    return small

df1 = pd.DataFrame(
    {'EML' : [np.nan, np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', 'EML', np.nan],
    'MAI' : [np.nan, 'MAI', np.nan, np.nan, 'MAI', np.nan, np.nan, np.nan, 'MAI', np.nan, np.nan, 'MAI'],
    'COMPANY' : ['7UP', '7UP', 'UPS', 'UPS', 'UPS', 'WSJ', 'WSJ', 'TJX', 'AIG', 'CDW', 'HEB', 'HEB'],
    'TEL' : ['TEL', np.nan, 'TEL', np.nan, np.nan, 'TEL', np.nan, 'TEL', np.nan, np.nan, np.nan, np.nan],
    'UID' : [273, 273, 865, 865, 865, 906, 906, 736, 316, 458, 531, 531]},
    columns=['UID', 'COMPANY', 'EML', 'MAI', 'TEL'])

cleanDF = columnRedux(df1)

我的问题是我有几个数据集,每个数据集都有自己的一组 "wide" 列。有些有 5 个以上的列要减少。对所有变体的转换进行硬编码并非易事。有没有更简洁的方法来完成此操作?

也许不是 "nicest" 解决方案。但是一种方法是使用一个简单的 groupby 并对包含的元素进行条件处理:

df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
    .apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
    .reset_index()\
    .rename(columns={0:'CONTACT_INFO'})

或者另一种方法是将分组的数据帧转换为 str 类型并替换字符串和总和。我会说非常可读。

m = {
    'nan':'',
    'EML':'E',
    'MAI':'M',
    'TEL':'T'
}

df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
       .apply(lambda x: x.astype(str).replace(m).sum().sum())\
       .reset_index()\
       .rename(columns={0:'CONTACT_INFO'})

完整示例:

import pandas as pd
import numpy as np

data = '''\
UID    COMPANY    EML    MAI   TEL
273    7UP        nan    nan   TEL
273    7UP        nan    MAI   nan
906    WSJ        nan    nan   TEL
906    WSJ        EML    nan   nan
736    AIG        nan    MAI   nan'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+').replace('NaN',np.nan)

# use a nested list comprehension to flatten the array and remove nans.
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
    .apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
    .reset_index()\
    .rename(columns={0:'CONTACT_INFO'})

print(df)

Returns:

UID  COMPANY  CONTACT_INFO
273      7UP            MT
736      AIG             M
906      WSJ            ET
dtype: object

让我们试试这个:

(df1.set_index(['UID','COMPANY']).notnull() * df1.columns[2:].str[0])\
.sum(level=[0,1]).sum(1).reset_index(name='CONTACT_INFO')

输出:

   UID COMPANY CONTACT_INFO
0  273     7UP           MT
1  865     UPS          EMT
2  906     WSJ           ET
3  736     TJX            T
4  316     AIG            M
5  458     CDW            E
6  531     HEB           EM

为@AntonvBR 分手:

df2 = df1.set_index(['UID','COMPANY'])
df_out  = ((df2.notnull() * df2.columns.str[0])
           .sum(level=[0,1]) #consolidate rows of contact info to one line
           .sum(1)  #sum across columns to create one column
           .reset_index(name='CONTACT_INFO'))
print(df_out)

输出:

   UID COMPANY CONTACT_INFO
0  273     7UP           MT
1  865     UPS          EMT
2  906     WSJ           ET
3  736     TJX            T
4  316     AIG            M
5  458     CDW            E
6  531     HEB           EM

通过使用 dotgroupby first

之后创建新列
s=df.groupby(['UID','COMPANY'],as_index=False).first()

s['CONTACT_INFO']=s[['EML','MAI','TEL']].notnull().dot(s.columns[2:].str[0])
s.dropna(1)
Out[349]: 
   UID COMPANY CONTACT_INFO
0  273     7UP           MT
1  736     AIG            M
2  906     WSJ           ET