如何合并列并删除重复项但保留唯一值?
How to merge columns and delete duplicates but keep unique values?
我想根据相同的 ID 合并列,并希望确保将行合并为一行(每个 ID)。谁能帮我合并重复项和非重复项的列?
鉴于:
ID Name Degree AM_Class PM_Class Online_Class
01 Kathy Biology Bio101 NaN NaN
01 Kathy Biology NaN Chem101 NaN
02 James Chemistry NaN Chem101 NaN
03 Henry Business Bus100 NaN NaN
03 Henry Business NaN Math100 NaN
03 Henry Business NaN NaN Acct100
预期输出:
ID Name Degree AM_Class PM_Class Online_Class
01 Kathy Biology Bio101 Chem101 NaN
02 James Chemistry NaN Chem101 NaN
03 Henry Business Bus100 Math100 Acct100
我尝试使用:
df = df.groupby(['Name','Degree','ID'])['AM_Class', 'PM_Class', 'Online_Class'].apply(', '.join).reset_index()
但似乎出错了..
如果每组需要第一个非缺失值,请使用 GroupBy.first
:
df = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 None
1 02 James Chemistry None Chem101 None
2 03 Henry Business Bus100 Math100 Acct100
或者如果需要每个组的所有唯一值都没有缺失值,请在 GroupBy.agg
for processing each column separately by Series.dropna
中使用自定义 lambda 函数,删除由 dict.fromkeys
重复的值,并由 ,
删除最后一个连接值:
f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
可能存在差异,请参阅更改后的数据:
print (df)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 NaN NaN
1 01 Kathy Biology NaN Chem101 NaN
2 02 James Chemistry NaN Chem101 NaN
3 03 Henry Business Bus100 NaN NaN
4 03 Henry Business NaN Math100 Acct100
5 03 Henry Business NaN Math200 Acct100
df1 = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df1)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 None
1 02 James Chemistry None Chem101 None
2 03 Henry Business Bus100 Math100 Acct100
f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df2 = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
print (df2)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 02 James Chemistry NaN Chem101 NaN
2 03 Henry Business Bus100 Math100, Math200 Acct100
这是您的数据:
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})
您可以分离数据框,删除 NaN 值,然后重新加入它们。
reduce()函数允许迭代地执行合并,而不必一个接一个地合并数据帧。
from functools import reduce
# Separate the data frames
df_student = df[['ID', 'Name', 'Degree']]
df_AM = df[['ID', 'Name', 'AM_Class']]
df_PM = df[['ID', 'Name', 'PM_Class']]
df_OL = df[['ID', 'Name', 'Online_Class']]
# List of data frames
dfs = [df_student, df_AM, df_PM, df_OL]
# Remove all NaNs
for df in dfs:
df.dropna(inplace=True)
# Merge dataframes without the NaNs
df_merged = reduce(lambda left, right: pd.merge(left, right, how='left', on=['ID', 'Name']), dfs)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 01 Kathy Biology Bio101 Chem101 NaN
2 02 James Chemistry NaN Chem101 NaN
3 03 Henry Business Bus100 Math100 Acct100
4 03 Henry Business Bus100 Math100 Acct100
5 03 Henry Business Bus100 Math100 Acct100
然后你只需要删除重复项。
df_merged.drop_duplicates(inplace=True).reset_index()
这是结果:
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 02 James Chemistry NaN Chem101 NaN
2 03 Henry Business Bus100 Math100 Acct100
您可以先 ffill
行,然后删除重复项,同时保留最后一次出现的重复项,
df.groupby(['ID']).ffill().drop_duplicates(subset='Name', keep='last')
我们可以用pandaspivot_table来解决这个问题
你的数据看起来像这样
>>> data = {'Name': ['Kathy','Kathy','James','Henry','Henry','Henry'],
'Degree': ['Biology','Biology','Chemistry','Business','Business','Business'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100'],
}
>>> df = pd.DataFrame(data)
>>> print(df)
Name Degree AM_Class PM_Class Online_Class
0 Kathy Biology Bio101 NaN NaN
1 Kathy Biology NaN Chem101 NaN
2 James Chemistry NaN Chem101 NaN
3 Henry Business Bus100 NaN NaN
4 Henry Business NaN Math100 NaN
5 Henry Business NaN NaN Acct100
首先我们可以将所有NaN
替换为null字符串
>>> df.fillna('', inplace=True)
>>> print(df)
Name Degree AM_Class PM_Class Online_Class
0 0 Biology Bio101
1 1 Biology Chem101
2 2 Chemistry Chem101
3 3 Business Bus100
4 4 Business Math100
5 5 Business Acct100
我这样做是因为在使用 pivot_table 函数时我想使用 np.sum
函数来连接 pandas.series 中的字符串。按原样使用 np.nan
会引发异常。
现在让 table 成为枢轴 table,Name
成为 group-by 列。
>>> df2 = pd.pivot_table(data=df, index=['Name'], aggfunc={'Degree':np.unique, 'AM_Class':np.sum, 'PM_Class':np.sum, 'Online_Class':np.sum})
>>> print(df2)
AM_Class Degree Online_Class PM_Class
Name
Henry Bus100 Business Acct100 Math100
James Chemistry Chem101
Kathy Bio101 Biology Chem101
我们必须用 np.nan 替换 nulls - 因为这是要求的格式。
>>> df2.replace('', np.nan, inplace=True)
>>> print(df2)
AM_Class Degree Online_Class PM_Class
Name
Henry Bus100 Business Acct100 Math100
James NaN Chemistry NaN Chem101
Kathy Bio101 Biology NaN Chem101
观察新的dataframe df2
,看来我们必须做如下改动
- 由于名称列已成为索引 - 我们必须创建一个 名称 列
- 添加一个RangeIndex
- 必须恢复列顺序
>>> df2['Name'] = df2.index
>>> cols = [ 'Name', 'Degree', 'AM_Class', 'PM_Class', 'Online_Class']
>>> df2 = df2[cols]
>>> print(df2)
Name Degree AM_Class PM_Class Online_Class
Name
Henry Henry Business Bus100 Math100 Acct100
James James Chemistry NaN Chem101 NaN
Kathy Kathy Biology Bio101 Chem101 NaN
>>> df2.set_index(pd.RangeIndex(start=0,stop=3,step=1), inplace=True)
>>> print(df2)
Name Degree AM_Class PM_Class Online_Class
0 Henry Business Bus100 Math100 Acct100
1 James Chemistry NaN Chem101 NaN
2 Kathy Biology Bio101 Chem101 NaN
请参阅下面我的替代解决方案。
import pandas as pd, numpy as np
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})
# merge duplicates, reset index
df = df.fillna('').groupby(['Name','Degree','ID'])[df.columns].agg(lambda x: ','.join(filter(None, x))).reset_index(drop=False)
输出:
Name Degree ID AM_Class PM_Class Online_Class
0 Henry Business 03 Bus100 Math100 Acct100
1 James Chemistry 02 Chem101
2 Kathy Biology 01 Bio101 Chem101
我想根据相同的 ID 合并列,并希望确保将行合并为一行(每个 ID)。谁能帮我合并重复项和非重复项的列?
鉴于:
ID Name Degree AM_Class PM_Class Online_Class
01 Kathy Biology Bio101 NaN NaN
01 Kathy Biology NaN Chem101 NaN
02 James Chemistry NaN Chem101 NaN
03 Henry Business Bus100 NaN NaN
03 Henry Business NaN Math100 NaN
03 Henry Business NaN NaN Acct100
预期输出:
ID Name Degree AM_Class PM_Class Online_Class
01 Kathy Biology Bio101 Chem101 NaN
02 James Chemistry NaN Chem101 NaN
03 Henry Business Bus100 Math100 Acct100
我尝试使用:
df = df.groupby(['Name','Degree','ID'])['AM_Class', 'PM_Class', 'Online_Class'].apply(', '.join).reset_index()
但似乎出错了..
如果每组需要第一个非缺失值,请使用 GroupBy.first
:
df = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 None
1 02 James Chemistry None Chem101 None
2 03 Henry Business Bus100 Math100 Acct100
或者如果需要每个组的所有唯一值都没有缺失值,请在 GroupBy.agg
for processing each column separately by Series.dropna
中使用自定义 lambda 函数,删除由 dict.fromkeys
重复的值,并由 ,
删除最后一个连接值:
f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
可能存在差异,请参阅更改后的数据:
print (df)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 NaN NaN
1 01 Kathy Biology NaN Chem101 NaN
2 02 James Chemistry NaN Chem101 NaN
3 03 Henry Business Bus100 NaN NaN
4 03 Henry Business NaN Math100 Acct100
5 03 Henry Business NaN Math200 Acct100
df1 = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df1)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 None
1 02 James Chemistry None Chem101 None
2 03 Henry Business Bus100 Math100 Acct100
f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df2 = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
print (df2)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 02 James Chemistry NaN Chem101 NaN
2 03 Henry Business Bus100 Math100, Math200 Acct100
这是您的数据:
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})
您可以分离数据框,删除 NaN 值,然后重新加入它们。
reduce()函数允许迭代地执行合并,而不必一个接一个地合并数据帧。
from functools import reduce
# Separate the data frames
df_student = df[['ID', 'Name', 'Degree']]
df_AM = df[['ID', 'Name', 'AM_Class']]
df_PM = df[['ID', 'Name', 'PM_Class']]
df_OL = df[['ID', 'Name', 'Online_Class']]
# List of data frames
dfs = [df_student, df_AM, df_PM, df_OL]
# Remove all NaNs
for df in dfs:
df.dropna(inplace=True)
# Merge dataframes without the NaNs
df_merged = reduce(lambda left, right: pd.merge(left, right, how='left', on=['ID', 'Name']), dfs)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 01 Kathy Biology Bio101 Chem101 NaN
2 02 James Chemistry NaN Chem101 NaN
3 03 Henry Business Bus100 Math100 Acct100
4 03 Henry Business Bus100 Math100 Acct100
5 03 Henry Business Bus100 Math100 Acct100
然后你只需要删除重复项。
df_merged.drop_duplicates(inplace=True).reset_index()
这是结果:
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 02 James Chemistry NaN Chem101 NaN
2 03 Henry Business Bus100 Math100 Acct100
您可以先 ffill
行,然后删除重复项,同时保留最后一次出现的重复项,
df.groupby(['ID']).ffill().drop_duplicates(subset='Name', keep='last')
我们可以用pandaspivot_table来解决这个问题 你的数据看起来像这样
>>> data = {'Name': ['Kathy','Kathy','James','Henry','Henry','Henry'],
'Degree': ['Biology','Biology','Chemistry','Business','Business','Business'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100'],
}
>>> df = pd.DataFrame(data)
>>> print(df)
Name Degree AM_Class PM_Class Online_Class
0 Kathy Biology Bio101 NaN NaN
1 Kathy Biology NaN Chem101 NaN
2 James Chemistry NaN Chem101 NaN
3 Henry Business Bus100 NaN NaN
4 Henry Business NaN Math100 NaN
5 Henry Business NaN NaN Acct100
首先我们可以将所有NaN
替换为null字符串
>>> df.fillna('', inplace=True)
>>> print(df)
Name Degree AM_Class PM_Class Online_Class
0 0 Biology Bio101
1 1 Biology Chem101
2 2 Chemistry Chem101
3 3 Business Bus100
4 4 Business Math100
5 5 Business Acct100
我这样做是因为在使用 pivot_table 函数时我想使用 np.sum
函数来连接 pandas.series 中的字符串。按原样使用 np.nan
会引发异常。
现在让 table 成为枢轴 table,Name
成为 group-by 列。
>>> df2 = pd.pivot_table(data=df, index=['Name'], aggfunc={'Degree':np.unique, 'AM_Class':np.sum, 'PM_Class':np.sum, 'Online_Class':np.sum})
>>> print(df2)
AM_Class Degree Online_Class PM_Class
Name
Henry Bus100 Business Acct100 Math100
James Chemistry Chem101
Kathy Bio101 Biology Chem101
我们必须用 np.nan 替换 nulls - 因为这是要求的格式。
>>> df2.replace('', np.nan, inplace=True)
>>> print(df2)
AM_Class Degree Online_Class PM_Class
Name
Henry Bus100 Business Acct100 Math100
James NaN Chemistry NaN Chem101
Kathy Bio101 Biology NaN Chem101
观察新的dataframe df2
,看来我们必须做如下改动
- 由于名称列已成为索引 - 我们必须创建一个 名称 列
- 添加一个RangeIndex
- 必须恢复列顺序
>>> df2['Name'] = df2.index
>>> cols = [ 'Name', 'Degree', 'AM_Class', 'PM_Class', 'Online_Class']
>>> df2 = df2[cols]
>>> print(df2)
Name Degree AM_Class PM_Class Online_Class
Name
Henry Henry Business Bus100 Math100 Acct100
James James Chemistry NaN Chem101 NaN
Kathy Kathy Biology Bio101 Chem101 NaN
>>> df2.set_index(pd.RangeIndex(start=0,stop=3,step=1), inplace=True)
>>> print(df2)
Name Degree AM_Class PM_Class Online_Class
0 Henry Business Bus100 Math100 Acct100
1 James Chemistry NaN Chem101 NaN
2 Kathy Biology Bio101 Chem101 NaN
请参阅下面我的替代解决方案。
import pandas as pd, numpy as np
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})
# merge duplicates, reset index
df = df.fillna('').groupby(['Name','Degree','ID'])[df.columns].agg(lambda x: ','.join(filter(None, x))).reset_index(drop=False)
输出:
Name Degree ID AM_Class PM_Class Online_Class
0 Henry Business 03 Bus100 Math100 Acct100
1 James Chemistry 02 Chem101
2 Kathy Biology 01 Bio101 Chem101