如何合并列并删除重复项但保留唯一值?

How to merge columns and delete duplicates but keep unique values?

我想根据相同的 ID 合并列,并希望确保将行合并为一行(每个 ID)。谁能帮我合并重复项和非重复项的列?

鉴于:

ID      Name     Degree       AM_Class     PM_Class     Online_Class
01      Kathy    Biology      Bio101       NaN          NaN
01      Kathy    Biology      NaN          Chem101      NaN
02      James    Chemistry    NaN          Chem101      NaN
03      Henry    Business     Bus100       NaN          NaN
03      Henry    Business     NaN          Math100      NaN
03      Henry    Business     NaN          NaN          Acct100

预期输出:

ID      Name     Degree       AM_Class     PM_Class     Online_Class
01      Kathy    Biology      Bio101       Chem101      NaN
02      James    Chemistry    NaN          Chem101      NaN
03      Henry    Business     Bus100       Math100      Acct100

我尝试使用:

df = df.groupby(['Name','Degree','ID'])['AM_Class', 'PM_Class', 'Online_Class'].apply(', '.join).reset_index()

但似乎出错了..

如果每组需要第一个非缺失值,请使用 GroupBy.first:

df = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101  Chem101         None
1  02  James  Chemistry     None  Chem101         None
2  03  Henry   Business   Bus100  Math100      Acct100

或者如果需要每个组的所有唯一值都没有缺失值,请在 GroupBy.agg for processing each column separately by Series.dropna 中使用自定义 lambda 函数,删除由 dict.fromkeys 重复的值,并由 , 删除最后一个连接值:

f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)

可能存在差异,请参阅更改后的数据:

print (df)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101      NaN          NaN
1  01  Kathy    Biology      NaN  Chem101          NaN
2  02  James  Chemistry      NaN  Chem101          NaN
3  03  Henry   Business   Bus100      NaN          NaN
4  03  Henry   Business      NaN  Math100      Acct100
5  03  Henry   Business      NaN  Math200      Acct100

df1 = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df1)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101  Chem101         None
1  02  James  Chemistry     None  Chem101         None
2  03  Henry   Business   Bus100  Math100      Acct100


f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df2 = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
print (df2)
   ID   Name     Degree AM_Class          PM_Class Online_Class
0  01  Kathy    Biology   Bio101           Chem101          NaN
1  02  James  Chemistry      NaN           Chem101          NaN
2  03  Henry   Business   Bus100  Math100, Math200      Acct100

这是您的数据:

df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
                   'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
                   'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
                   'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
                   'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
                   'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})

您可以分离数据框,删除 NaN 值,然后重新加入它们。

reduce()函数允许迭代地执行合并,而不必一个接一个地合并数据帧。

from functools import reduce

# Separate the data frames
df_student = df[['ID', 'Name', 'Degree']]
df_AM = df[['ID', 'Name', 'AM_Class']]
df_PM = df[['ID', 'Name', 'PM_Class']]
df_OL = df[['ID', 'Name', 'Online_Class']]

# List of data frames
dfs = [df_student, df_AM, df_PM, df_OL]

# Remove all NaNs
for df in dfs:
    df.dropna(inplace=True)

# Merge dataframes without the NaNs
df_merged = reduce(lambda left, right: pd.merge(left, right, how='left', on=['ID', 'Name']), dfs)


    ID  Name    Degree      AM_Class    PM_Class    Online_Class
0   01  Kathy   Biology     Bio101      Chem101     NaN
1   01  Kathy   Biology     Bio101      Chem101     NaN
2   02  James   Chemistry   NaN         Chem101     NaN
3   03  Henry   Business    Bus100      Math100     Acct100
4   03  Henry   Business    Bus100      Math100     Acct100
5   03  Henry   Business    Bus100      Math100     Acct100

然后你只需要删除重复项。

df_merged.drop_duplicates(inplace=True).reset_index()

这是结果:

     ID Name    Degree      AM_Class    PM_Class    Online_Class
0    01 Kathy   Biology     Bio101      Chem101     NaN
1    02 James   Chemistry   NaN         Chem101     NaN
2    03 Henry   Business    Bus100      Math100     Acct100

您可以先 ffill 行,然后删除重复项,同时保留最后一次出现的重复项,

df.groupby(['ID']).ffill().drop_duplicates(subset='Name', keep='last')

我们可以用pandaspivot_table来解决这个问题 你的数据看起来像这样

>>> data = {'Name': ['Kathy','Kathy','James','Henry','Henry','Henry'],
        'Degree': ['Biology','Biology','Chemistry','Business','Business','Business'],
        'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
        'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
        'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100'],
        
       }
>>> df = pd.DataFrame(data)

>>> print(df)

 Name     Degree AM_Class PM_Class Online_Class
0  Kathy    Biology   Bio101      NaN          NaN
1  Kathy    Biology      NaN  Chem101          NaN
2  James  Chemistry      NaN  Chem101          NaN
3  Henry   Business   Bus100      NaN          NaN
4  Henry   Business      NaN  Math100          NaN
5  Henry   Business      NaN      NaN      Acct100

首先我们可以将所有NaN替换为null字符串

>>> df.fillna('', inplace=True)

>>> print(df)

Name     Degree AM_Class PM_Class Online_Class
0     0    Biology   Bio101                      
1     1    Biology           Chem101             
2     2  Chemistry           Chem101             
3     3   Business   Bus100                      
4     4   Business           Math100             
5     5   Business                        Acct100

我这样做是因为在使用 pivot_table 函数时我想使用 np.sum 函数来连接 pandas.series 中的字符串。按原样使用 np.nan 会引发异常。

现在让 table 成为枢轴 table,Name 成为 group-by 列。

>>> df2 = pd.pivot_table(data=df, index=['Name'], aggfunc={'Degree':np.unique, 'AM_Class':np.sum, 'PM_Class':np.sum, 'Online_Class':np.sum})

>>> print(df2)

AM_Class     Degree Online_Class PM_Class
Name                                           
Henry   Bus100   Business      Acct100  Math100
James           Chemistry               Chem101
Kathy   Bio101    Biology               Chem101

我们必须用 np.nan 替换 nulls - 因为这是要求的格式。

>>> df2.replace('', np.nan, inplace=True)

>>> print(df2)

AM_Class     Degree Online_Class PM_Class
Name                                           
Henry   Bus100   Business      Acct100  Math100
James      NaN  Chemistry          NaN  Chem101
Kathy   Bio101    Biology          NaN  Chem101

观察新的dataframe df2,看来我们必须做如下改动

  • 由于名称列已成为索引 - 我们必须创建一个 名称
  • 添加一个RangeIndex
  • 必须恢复列顺序
>>> df2['Name'] = df2.index

>>> cols = [ 'Name', 'Degree', 'AM_Class',  'PM_Class', 'Online_Class']

>>> df2 = df2[cols]

>>> print(df2)

 Name     Degree AM_Class PM_Class Online_Class
Name                                                  
Henry  Henry   Business   Bus100  Math100      Acct100
James  James  Chemistry      NaN  Chem101          NaN
Kathy  Kathy    Biology   Bio101  Chem101          NaN

>>> df2.set_index(pd.RangeIndex(start=0,stop=3,step=1), inplace=True)

>>> print(df2)

 Name     Degree AM_Class PM_Class Online_Class
0  Henry   Business   Bus100  Math100      Acct100
1  James  Chemistry      NaN  Chem101          NaN
2  Kathy    Biology   Bio101  Chem101          NaN

请参阅下面我的替代解决方案。

import pandas as pd, numpy as np
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
                   'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
                   'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
                   'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
                   'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
                   'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})

# merge duplicates, reset index
df = df.fillna('').groupby(['Name','Degree','ID'])[df.columns].agg(lambda x: ','.join(filter(None, x))).reset_index(drop=False)

输出:

    Name    Degree      ID      AM_Class    PM_Class    Online_Class
0   Henry   Business    03      Bus100      Math100     Acct100
1   James   Chemistry   02                  Chem101 
2   Kathy   Biology     01      Bio101      Chem101