如何使用修改后的 bfill pandas 将附近的重复项折叠成一行
How to collapse near duplicates into one row using modified bfill pandas
我有一个如下所示的数据框
ID,F1,F2,F3,F4,F5,F6,L1,L2,L3,L4,L5,L6
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
2,X,,,X,,X,A,B,C,D,E
3,X,X,X,,,X,A
3,X,X,X,,,X,,B,,C
3,X,X,X,,,X,,D,C
4,X,X,,,,,A,B
4,,,,X,,X,G,H,I
4,,,X,,,,T
df = pd.read_clipboard(sep=',')
我想做以下事情
a) 删除完全重复项(每列的所有值都匹配)。 ex: ID=1 (keep=first)
b) 将附近的重复项折叠成一行。 例如:ID= 3 和 4。几乎重复的是只有 ID 匹配但 F numbered and L number columns differ
的其余部分匹配的行
我正在尝试下面的操作,但它导致输出不正确
下面的代码没有复制其他的L numbered values which doesn't have NA before
df = df.drop_duplicates(keep='first') # this drops full duplicates ex:ID = 1
df.groupby(['ID'])['ID','F1','F2','F3','F4','F5','F6','L1','L2','L3','L4','L5','L6'].bfill().drop_duplicates(subset=['ID'],keep='first')
在实际数据中,有50个F列和50个L列。对于F columns
,X的位置很重要,必须正确,而对于L列,它可以在任何地方,只要捕获它就可以了。
我希望我的输出如下所示
使用:
#first omit all duplicates by all columns
df = df.drop_duplicates(keep='first')
cL = df.filter(like='L').columns
cF = df.filter(like='F').columns
def f(x):
s = pd.Series(x.stack().unique()).rename(lambda x: f'L{x + 1}')
print (s)
return s
#recreate L columns by remove missing values and duplicates
#f = lambda x: pd.Series(x.stack().unique()).rename(lambda x: f'L{x + 1}')
df1 = df[cL].groupby(df['ID']).apply(f).unstack()
#remove original L columns
df = df.drop(cL, axis=1)
#for F columns processing with original solution
df[cF] = df.groupby(['ID'])[cF].bfill()
#after remove duplicates for F columns add L columns in df1
df = df.drop_duplicates(subset=['ID'],keep='first').join(df1, on='ID')
print (df)
ID F1 F2 F3 F4 F5 F6 L1 L2 L3 L4 L5 L6
0 1 X NaN X NaN NaN X A B C NaN NaN NaN
3 2 X NaN NaN X NaN X A B C D E NaN
4 3 X X X NaN NaN X A B C D NaN NaN
7 4 X X X X NaN X A B G H I T
我有一个如下所示的数据框
ID,F1,F2,F3,F4,F5,F6,L1,L2,L3,L4,L5,L6
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
2,X,,,X,,X,A,B,C,D,E
3,X,X,X,,,X,A
3,X,X,X,,,X,,B,,C
3,X,X,X,,,X,,D,C
4,X,X,,,,,A,B
4,,,,X,,X,G,H,I
4,,,X,,,,T
df = pd.read_clipboard(sep=',')
我想做以下事情
a) 删除完全重复项(每列的所有值都匹配)。 ex: ID=1 (keep=first)
b) 将附近的重复项折叠成一行。 例如:ID= 3 和 4。几乎重复的是只有 ID 匹配但 F numbered and L number columns differ
我正在尝试下面的操作,但它导致输出不正确
下面的代码没有复制其他的L numbered values which doesn't have NA before
df = df.drop_duplicates(keep='first') # this drops full duplicates ex:ID = 1
df.groupby(['ID'])['ID','F1','F2','F3','F4','F5','F6','L1','L2','L3','L4','L5','L6'].bfill().drop_duplicates(subset=['ID'],keep='first')
在实际数据中,有50个F列和50个L列。对于F columns
,X的位置很重要,必须正确,而对于L列,它可以在任何地方,只要捕获它就可以了。
我希望我的输出如下所示
使用:
#first omit all duplicates by all columns
df = df.drop_duplicates(keep='first')
cL = df.filter(like='L').columns
cF = df.filter(like='F').columns
def f(x):
s = pd.Series(x.stack().unique()).rename(lambda x: f'L{x + 1}')
print (s)
return s
#recreate L columns by remove missing values and duplicates
#f = lambda x: pd.Series(x.stack().unique()).rename(lambda x: f'L{x + 1}')
df1 = df[cL].groupby(df['ID']).apply(f).unstack()
#remove original L columns
df = df.drop(cL, axis=1)
#for F columns processing with original solution
df[cF] = df.groupby(['ID'])[cF].bfill()
#after remove duplicates for F columns add L columns in df1
df = df.drop_duplicates(subset=['ID'],keep='first').join(df1, on='ID')
print (df)
ID F1 F2 F3 F4 F5 F6 L1 L2 L3 L4 L5 L6
0 1 X NaN X NaN NaN X A B C NaN NaN NaN
3 2 X NaN NaN X NaN X A B C D E NaN
4 3 X X X NaN NaN X A B C D NaN NaN
7 4 X X X X NaN X A B G H I T