Pandas 每行唯一值,包含数据的列数可变
Pandas unique values per row, variable number of columns with data
考虑以下数据框:
import pandas as pd
from numpy import nan
data = [
(111, nan, nan, 111),
(112, 112, nan, 115),
(113, nan, nan, nan),
(nan, nan, nan, nan),
(118, 110, 117, nan),
]
df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])
num0 num1 num2 num3
0 111.0 NaN NaN 111.0
1 112.0 112.0 NaN 115.0
2 113.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 118.0 110.0 117.0 NaN
假设我的索引是唯一的,我希望检索每个索引行的唯一值,输出如下所示。我希望保留空行。
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
我有一个可行的解决方案,尽管速度很慢,请参见下文。输出编号顺序无关紧要,只要所有值都显示在最左侧的列中,而空值显示在右侧即可。
我正在寻找加速代码的最佳实践和潜在想法。提前谢谢你。
def arrange_row(row):
values = list(set(row.dropna(axis=1).values[0]))
values = [nan] if not values else values
series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values)+1)])
return series
df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
pd.version == '1.2.3'
使用df.values
with List comprehension
and df.dropna
:
# Create a list of rows of dataframe
In [788]: l = df.values
# Use List Comprehension to remove dups from above list of lists
In [789]: l_without_dupes = [list(dict.fromkeys(i)) for i in l]
# Create a new dataframe from above list and drop the column with all NaN's
In [795]: res_df = pd.DataFrame(l_without_dupes).dropna(1, how='all')
In [796]: res_df
Out[796]:
0 1 2
0 111.0 NaN NaN
1 112.0 NaN 115.0
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
另一种选择,虽然更长:
outcome = (df.melt(ignore_index= False) # keep the index as a tracker
.reset_index()
# get the unique rows
.drop_duplicates(subset=['index','value'])
.dropna()
# use this to build the new column names
.assign(counter = lambda df: df.groupby('index').cumcount() + 1)
.pivot('index', 'counter', 'value')
.add_prefix('num')
.reindex(df.index)
.rename_axis(columns=None)
)
outcome
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
如果您希望它与您的输出完全匹配,您可以将其转储到 numpy 中,排序并 return 到 pandas:
pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
另一种选择是在 Pandas 中重塑之前在 numpy 中进行排序:
(pd.DataFrame(np.sort(df, axis = 1))
.apply(pd.unique, axis=1)
.apply(pd.Series)
.dropna(how='all',axis=1)
.set_axis(['num1', 'num2','num3'], axis=1)
)
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
我们可以stack
to reshape the dataframe, then group the reshaped frame on level=0
and aggregate using unqiue
从每一行中获取唯一值,然后您可以从这些唯一值创建一个新的数据框
s = df.stack().groupby(level=0).unique()
pd.DataFrame([*s], index=s.index).reindex(df.index)
0 1 2
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
考虑以下数据框:
import pandas as pd
from numpy import nan
data = [
(111, nan, nan, 111),
(112, 112, nan, 115),
(113, nan, nan, nan),
(nan, nan, nan, nan),
(118, 110, 117, nan),
]
df = pd.DataFrame(data, columns=[f'num{i}' for i in range(len(data[0]))])
num0 num1 num2 num3
0 111.0 NaN NaN 111.0
1 112.0 112.0 NaN 115.0
2 113.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 118.0 110.0 117.0 NaN
假设我的索引是唯一的,我希望检索每个索引行的唯一值,输出如下所示。我希望保留空行。
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
我有一个可行的解决方案,尽管速度很慢,请参见下文。输出编号顺序无关紧要,只要所有值都显示在最左侧的列中,而空值显示在右侧即可。 我正在寻找加速代码的最佳实践和潜在想法。提前谢谢你。
def arrange_row(row):
values = list(set(row.dropna(axis=1).values[0]))
values = [nan] if not values else values
series = pd.Series(values, index=[f"num{i}" for i in range(1, len(values)+1)])
return series
df.groupby(level=-1).apply(arrange_row).unstack(level=-1)
pd.version == '1.2.3'
使用df.values
with List comprehension
and df.dropna
:
# Create a list of rows of dataframe
In [788]: l = df.values
# Use List Comprehension to remove dups from above list of lists
In [789]: l_without_dupes = [list(dict.fromkeys(i)) for i in l]
# Create a new dataframe from above list and drop the column with all NaN's
In [795]: res_df = pd.DataFrame(l_without_dupes).dropna(1, how='all')
In [796]: res_df
Out[796]:
0 1 2
0 111.0 NaN NaN
1 112.0 NaN 115.0
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
另一种选择,虽然更长:
outcome = (df.melt(ignore_index= False) # keep the index as a tracker
.reset_index()
# get the unique rows
.drop_duplicates(subset=['index','value'])
.dropna()
# use this to build the new column names
.assign(counter = lambda df: df.groupby('index').cumcount() + 1)
.pivot('index', 'counter', 'value')
.add_prefix('num')
.reindex(df.index)
.rename_axis(columns=None)
)
outcome
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0
如果您希望它与您的输出完全匹配,您可以将其转储到 numpy 中,排序并 return 到 pandas:
pd.DataFrame(np.sort(outcome, axis = 1), columns = outcome.columns)
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
另一种选择是在 Pandas 中重塑之前在 numpy 中进行排序:
(pd.DataFrame(np.sort(df, axis = 1))
.apply(pd.unique, axis=1)
.apply(pd.Series)
.dropna(how='all',axis=1)
.set_axis(['num1', 'num2','num3'], axis=1)
)
num1 num2 num3
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 110.0 117.0 118.0
我们可以stack
to reshape the dataframe, then group the reshaped frame on level=0
and aggregate using unqiue
从每一行中获取唯一值,然后您可以从这些唯一值创建一个新的数据框
s = df.stack().groupby(level=0).unique()
pd.DataFrame([*s], index=s.index).reindex(df.index)
0 1 2
0 111.0 NaN NaN
1 112.0 115.0 NaN
2 113.0 NaN NaN
3 NaN NaN NaN
4 118.0 110.0 117.0