Pandas 将 groupby 之后的值计数扩展为列

Pandas expand value counts after groupby as columns

作为特征工程的一部分,我想使用groupby之后的列的计数作为模型的特征,这是我尝试过的

>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
   col1  col2  col3
0    a  val1  val3
1    b  val2  val4
2    a  val2  val3
3    c  val1  val4
4    a  val2  val3
5    b  val2  val4
>>> test = df.groupby('col1').agg(list)
                    col2                col3
col1
a     [val1, val2, val2]  [val3, val3, val3]
b           [val2, val2]        [val4, val4]
c                 [val1]              [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
                        col2         col3
col1
a     {'val1': 1, 'val2': 2}  {'val3': 3}
b                {'val2': 2}  {'val4': 2}
c                {'val1': 1}  {'val4': 1}

稍后我可以将字典扩展到单独的列中,因此最终输出将是:

>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
   val1 val2 val3 val4
a  1.0  2.0  3.0  NaN
b  NaN  2.0  NaN  2.0
c  1.0  NaN  NaN  1.0

我觉得有一个更简单的解决方案,我们将不胜感激。

是的,melt+crosstab:

df2 = df.melt(id_vars='col1', value_name='count')
pd.crosstab(df2['col1'], df2['count'])

输出:

count  val1  val2  val3  val4
col1                         
a         1     2     3     0
b         0     2     0     2
c         1     0     0     1

如果你想要NaN:

df3 = pd.crosstab(df2['col1'], df2['count'])
df3.mask(df3.eq(0))

输出:

count  val1  val2  val3  val4
col1                         
a       1.0   2.0   3.0   NaN
b       NaN   2.0   NaN   2.0
c       1.0   NaN   NaN   1.0
df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
print(df)

输出:

col2  val1  val2  val3  val4
col1
a      1.0   2.0   3.0   NaN
b      NaN   2.0   NaN   2.0
c      1.0   NaN   NaN   1.0

另一个结合了 melt、groupby 和 unstack 的选项:

(df.melt('col1')
   .groupby(['col1', 'value'])
   .size()
   .unstack()
   .rename_axis(index=None, columns=None)
)
       val1  val2  val3  val4
a       1.0   2.0   3.0   NaN
b       NaN   2.0   NaN   2.0
c       1.0   NaN   NaN   1.0