如何摆脱数据框列的单元格中的重复元素，其中元素可以是多个单词或单个单词？

Question

我想删除下面两列单元格中的所有重复单词或单词组，并在每个单词或单词组之间保持“,”。我尝试了一个使用 return (' , '.join(dict.fromkeys(text.split()))) 的函数，然后在每一列上应用这个 function 但它分隔了我不想分隔的单词并添加了不需要的逗号（例如 three four 不应该用逗号分隔） .该解决方案将应用于 col2 和 col3.

中的更多行

代码：

df0 = pd.DataFrame(data ={'col1':[123,123,123],'col2':['one , two , three four', 'two','three four'],
                          'col3':['many numbers , another number', 'number','another number , number']})

df0['col2'] = df0.groupby(['col1'])['col2'].transform(lambda x : ' , '.join(x))
df0['col3'] = df0.groupby(['col1'])['col3'].transform(lambda x : ' , '.join(x))
df0 = df0.drop_duplicates()

df0

当前输出：

    col1    col2                                            col3
0   123     one , two , three four , two , three four       many numbers , another number , number , another number , number

期望的输出：

    col1    col2                        col3
0   123     one , two , three four      many numbers , another number , number

Answer 1

.transform() 将保持原始组中存在的行数。由于您似乎在数据帧上使用 .drop_duplicates() 来否定这一点，因此最好首先使用 .agg()。

从那里开始，解决方案与您的解决方案类似，但使用 set 而不是 dict（类似但更简单）并将定界符 ' , ' 传回 split.

假设元素的最终顺序无关紧要，这将有效：

delim = ' , '
df0 = df0.groupby('col1', as_index = False)[['col2', 'col3']].agg(lambda s: ' , '.join(set(delim.join(s).split(delim))))

如何摆脱数据框列的单元格中的重复元素，其中元素可以是多个单词或单个单词？

How do I get rid of duplicate elements in a cell of a dataframe column where an element can be multiple words or a single word?

python

group-by

duplicates

dataframe

pandas