在宽 Pandas DataFrame 中将 3 个文本列折叠为 1 个
Collapse 3 Text Colums into 1 within a wide Pandas DataFrame
我有一个数据集,其中一种数据类型分布在多个列中。我想将这些减少到一个列。我有一个函数可以完成此操作,但这是一个繁琐的过程,我希望有一种更简洁的方法来完成此操作。这是我的数据的玩具样本:
UID COMPANY EML MAI TEL
273 7UP nan nan TEL
273 7UP nan MAI nan
906 WSJ nan nan TEL
906 WSJ EML nan nan
736 AIG nan MAI nan
我想去的地方:
UID COMPANY CONTACT_INFO
273 7UP MT
906 WSJ ET
736 AIG M
我已经通过编写一个将 EML
、MAI
或 TEL
转换为素数、聚合结果然后将总和转换为组成联系人类型的函数来解决此问题.这有效,而且相当快。这是一个示例:
def columnRedux(df):
newDF = df.copy()
newDF.fillna('-', inplace=True)
newDF['CONTACT_INFO'] = newDF['EML'] + newDF['MAI'] + newDF['TEL']
newDF.replace('EML--', 7, inplace=True)
newDF.replace('-MAI-', 101, inplace=True)
newDF.replace('--TEL', 1009, inplace=True)
small = newDF.groupby(['UID', 'COMPANY'], as_index=False)['CONTACT_INFO'].sum()
small.replace(7, 'E', inplace=True)
small.replace(101, 'M', inplace=True)
small.replace(108, 'EM', inplace=True)
small.replace(1009, 'T', inplace=True)
small.replace(1016, 'ET', inplace=True)
small.replace(1110, 'MT', inplace=True)
small.replace(1117, 'EMT', inplace=True)
return small
df1 = pd.DataFrame(
{'EML' : [np.nan, np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', 'EML', np.nan],
'MAI' : [np.nan, 'MAI', np.nan, np.nan, 'MAI', np.nan, np.nan, np.nan, 'MAI', np.nan, np.nan, 'MAI'],
'COMPANY' : ['7UP', '7UP', 'UPS', 'UPS', 'UPS', 'WSJ', 'WSJ', 'TJX', 'AIG', 'CDW', 'HEB', 'HEB'],
'TEL' : ['TEL', np.nan, 'TEL', np.nan, np.nan, 'TEL', np.nan, 'TEL', np.nan, np.nan, np.nan, np.nan],
'UID' : [273, 273, 865, 865, 865, 906, 906, 736, 316, 458, 531, 531]},
columns=['UID', 'COMPANY', 'EML', 'MAI', 'TEL'])
cleanDF = columnRedux(df1)
我的问题是我有几个数据集,每个数据集都有自己的一组 "wide" 列。有些有 5 个以上的列要减少。对所有变体的转换进行硬编码并非易事。有没有更简洁的方法来完成此操作?
也许不是 "nicest" 解决方案。但是一种方法是使用一个简单的 groupby 并对包含的元素进行条件处理:
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
或者另一种方法是将分组的数据帧转换为 str 类型并替换字符串和总和。我会说非常可读。
m = {
'nan':'',
'EML':'E',
'MAI':'M',
'TEL':'T'
}
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: x.astype(str).replace(m).sum().sum())\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
完整示例:
import pandas as pd
import numpy as np
data = '''\
UID COMPANY EML MAI TEL
273 7UP nan nan TEL
273 7UP nan MAI nan
906 WSJ nan nan TEL
906 WSJ EML nan nan
736 AIG nan MAI nan'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+').replace('NaN',np.nan)
# use a nested list comprehension to flatten the array and remove nans.
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
print(df)
Returns:
UID COMPANY CONTACT_INFO
273 7UP MT
736 AIG M
906 WSJ ET
dtype: object
让我们试试这个:
(df1.set_index(['UID','COMPANY']).notnull() * df1.columns[2:].str[0])\
.sum(level=[0,1]).sum(1).reset_index(name='CONTACT_INFO')
输出:
UID COMPANY CONTACT_INFO
0 273 7UP MT
1 865 UPS EMT
2 906 WSJ ET
3 736 TJX T
4 316 AIG M
5 458 CDW E
6 531 HEB EM
为@AntonvBR 分手:
df2 = df1.set_index(['UID','COMPANY'])
df_out = ((df2.notnull() * df2.columns.str[0])
.sum(level=[0,1]) #consolidate rows of contact info to one line
.sum(1) #sum across columns to create one column
.reset_index(name='CONTACT_INFO'))
print(df_out)
输出:
UID COMPANY CONTACT_INFO
0 273 7UP MT
1 865 UPS EMT
2 906 WSJ ET
3 736 TJX T
4 316 AIG M
5 458 CDW E
6 531 HEB EM
通过使用 dot
在 groupby
first
之后创建新列
s=df.groupby(['UID','COMPANY'],as_index=False).first()
s['CONTACT_INFO']=s[['EML','MAI','TEL']].notnull().dot(s.columns[2:].str[0])
s.dropna(1)
Out[349]:
UID COMPANY CONTACT_INFO
0 273 7UP MT
1 736 AIG M
2 906 WSJ ET
我有一个数据集,其中一种数据类型分布在多个列中。我想将这些减少到一个列。我有一个函数可以完成此操作,但这是一个繁琐的过程,我希望有一种更简洁的方法来完成此操作。这是我的数据的玩具样本:
UID COMPANY EML MAI TEL
273 7UP nan nan TEL
273 7UP nan MAI nan
906 WSJ nan nan TEL
906 WSJ EML nan nan
736 AIG nan MAI nan
我想去的地方:
UID COMPANY CONTACT_INFO
273 7UP MT
906 WSJ ET
736 AIG M
我已经通过编写一个将 EML
、MAI
或 TEL
转换为素数、聚合结果然后将总和转换为组成联系人类型的函数来解决此问题.这有效,而且相当快。这是一个示例:
def columnRedux(df):
newDF = df.copy()
newDF.fillna('-', inplace=True)
newDF['CONTACT_INFO'] = newDF['EML'] + newDF['MAI'] + newDF['TEL']
newDF.replace('EML--', 7, inplace=True)
newDF.replace('-MAI-', 101, inplace=True)
newDF.replace('--TEL', 1009, inplace=True)
small = newDF.groupby(['UID', 'COMPANY'], as_index=False)['CONTACT_INFO'].sum()
small.replace(7, 'E', inplace=True)
small.replace(101, 'M', inplace=True)
small.replace(108, 'EM', inplace=True)
small.replace(1009, 'T', inplace=True)
small.replace(1016, 'ET', inplace=True)
small.replace(1110, 'MT', inplace=True)
small.replace(1117, 'EMT', inplace=True)
return small
df1 = pd.DataFrame(
{'EML' : [np.nan, np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', np.nan, np.nan, 'EML', 'EML', np.nan],
'MAI' : [np.nan, 'MAI', np.nan, np.nan, 'MAI', np.nan, np.nan, np.nan, 'MAI', np.nan, np.nan, 'MAI'],
'COMPANY' : ['7UP', '7UP', 'UPS', 'UPS', 'UPS', 'WSJ', 'WSJ', 'TJX', 'AIG', 'CDW', 'HEB', 'HEB'],
'TEL' : ['TEL', np.nan, 'TEL', np.nan, np.nan, 'TEL', np.nan, 'TEL', np.nan, np.nan, np.nan, np.nan],
'UID' : [273, 273, 865, 865, 865, 906, 906, 736, 316, 458, 531, 531]},
columns=['UID', 'COMPANY', 'EML', 'MAI', 'TEL'])
cleanDF = columnRedux(df1)
我的问题是我有几个数据集,每个数据集都有自己的一组 "wide" 列。有些有 5 个以上的列要减少。对所有变体的转换进行硬编码并非易事。有没有更简洁的方法来完成此操作?
也许不是 "nicest" 解决方案。但是一种方法是使用一个简单的 groupby 并对包含的元素进行条件处理:
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
或者另一种方法是将分组的数据帧转换为 str 类型并替换字符串和总和。我会说非常可读。
m = {
'nan':'',
'EML':'E',
'MAI':'M',
'TEL':'T'
}
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: x.astype(str).replace(m).sum().sum())\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
完整示例:
import pandas as pd
import numpy as np
data = '''\
UID COMPANY EML MAI TEL
273 7UP nan nan TEL
273 7UP nan MAI nan
906 WSJ nan nan TEL
906 WSJ EML nan nan
736 AIG nan MAI nan'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+').replace('NaN',np.nan)
# use a nested list comprehension to flatten the array and remove nans.
df = df.groupby(['UID','COMPANY'])[['EML','MAI','TEL']]\
.apply(lambda x: ''.join(sorted([i[0] for y in x.values for i in y if pd.notnull(i)])))\
.reset_index()\
.rename(columns={0:'CONTACT_INFO'})
print(df)
Returns:
UID COMPANY CONTACT_INFO
273 7UP MT
736 AIG M
906 WSJ ET
dtype: object
让我们试试这个:
(df1.set_index(['UID','COMPANY']).notnull() * df1.columns[2:].str[0])\
.sum(level=[0,1]).sum(1).reset_index(name='CONTACT_INFO')
输出:
UID COMPANY CONTACT_INFO
0 273 7UP MT
1 865 UPS EMT
2 906 WSJ ET
3 736 TJX T
4 316 AIG M
5 458 CDW E
6 531 HEB EM
为@AntonvBR 分手:
df2 = df1.set_index(['UID','COMPANY'])
df_out = ((df2.notnull() * df2.columns.str[0])
.sum(level=[0,1]) #consolidate rows of contact info to one line
.sum(1) #sum across columns to create one column
.reset_index(name='CONTACT_INFO'))
print(df_out)
输出:
UID COMPANY CONTACT_INFO
0 273 7UP MT
1 865 UPS EMT
2 906 WSJ ET
3 736 TJX T
4 316 AIG M
5 458 CDW E
6 531 HEB EM
通过使用 dot
在 groupby
first
s=df.groupby(['UID','COMPANY'],as_index=False).first()
s['CONTACT_INFO']=s[['EML','MAI','TEL']].notnull().dot(s.columns[2:].str[0])
s.dropna(1)
Out[349]:
UID COMPANY CONTACT_INFO
0 273 7UP MT
1 736 AIG M
2 906 WSJ ET