Pandas 每行唯一值，包含数据的列数可变

Question

考虑以下数据框：

import pandas as pd
from numpy import nan

data = [
    (111, nan, nan, 111),
    (112, 112, nan, 115),
    (113, nan, nan, nan),
    (nan, nan, nan, nan),
    (118, 110, 117, nan),
]

df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])

    num0    num1    num2    num3
0   111.0   NaN     NaN     111.0
1   112.0   112.0   NaN     115.0
2   113.0   NaN     NaN     NaN
3   NaN     NaN     NaN     NaN
4   118.0   110.0   117.0   NaN

假设我的索引是唯一的，我希望检索每个索引行的唯一值，输出如下所示。我希望保留空行。

    num1    num2    num3
0   111.0   NaN     NaN
1   112.0   115.0   NaN
2   113.0   NaN     NaN
3   NaN     NaN     NaN
4   110.0   117.0   118.0

我有一个可行的解决方案，尽管速度很慢，请参见下文。输出编号顺序无关紧要，只要所有值都显示在最左侧的列中，而空值显示在右侧即可。我正在寻找加速代码的最佳实践和潜在想法。提前谢谢你。

def arrange_row(row):
    values = list(set(row.dropna(axis=1).values[0]))
    values = [nan] if not values else values
    series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values)+1)])
    return series

df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
pd.version == '1.2.3'

Answer 1

使用df.values with List comprehension and df.dropna:

# Create a list of rows of dataframe
In [788]: l = df.values 

# Use List Comprehension to remove dups from above list of lists
In [789]: l_without_dupes = [list(dict.fromkeys(i)) for i in l]

# Create a new dataframe from above list and drop the column with all NaN's
In [795]: res_df = pd.DataFrame(l_without_dupes).dropna(1, how='all')

In [796]: res_df
Out[796]: 
       0      1      2
0  111.0    NaN    NaN
1  112.0    NaN  115.0
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  118.0  110.0  117.0

Answer 2

另一种选择，虽然更长：

outcome = (df.melt(ignore_index= False) # keep the index as a tracker
             .reset_index()
            # get the unique rows
             .drop_duplicates(subset=['index','value'])
             .dropna()
            # use this to build the new column names
             .assign(counter = lambda df: df.groupby('index').cumcount() + 1)
             .pivot('index', 'counter', 'value')
             .add_prefix('num')
             .reindex(df.index)
             .rename_axis(columns=None)
) 

outcome 

    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  118.0  110.0  117.0

如果您希望它与您的输出完全匹配，您可以将其转储到 numpy 中，排序并 return 到 pandas:

pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)

    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  110.0  117.0  118.0

另一种选择是在 Pandas 中重塑之前在 numpy 中进行排序：

(pd.DataFrame(np.sort(df, axis = 1))
   .apply(pd.unique, axis=1)
   .apply(pd.Series)
   .dropna(how='all',axis=1)
   .set_axis(['num1', 'num2','num3'], axis=1)
) 
    num1   num2   num3
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  110.0  117.0  118.0

Answer 3

我们可以stack to reshape the dataframe, then group the reshaped frame on level=0 and aggregate using unqiue从每一行中获取唯一值，然后您可以从这些唯一值创建一个新的数据框

s = df.stack().groupby(level=0).unique()
pd.DataFrame([*s], index=s.index).reindex(df.index)

       0      1      2
0  111.0    NaN    NaN
1  112.0  115.0    NaN
2  113.0    NaN    NaN
3    NaN    NaN    NaN
4  118.0  110.0  117.0

Pandas 每行唯一值，包含数据的列数可变

Pandas unique values per row, variable number of columns with data

python

apply

dataframe

pandas