Pandas 将 groupby 之后的值计数扩展为列
Pandas expand value counts after groupby as columns
作为特征工程的一部分,我想使用groupby之后的列的计数作为模型的特征,这是我尝试过的
>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
col1 col2 col3
0 a val1 val3
1 b val2 val4
2 a val2 val3
3 c val1 val4
4 a val2 val3
5 b val2 val4
>>> test = df.groupby('col1').agg(list)
col2 col3
col1
a [val1, val2, val2] [val3, val3, val3]
b [val2, val2] [val4, val4]
c [val1] [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
col2 col3
col1
a {'val1': 1, 'val2': 2} {'val3': 3}
b {'val2': 2} {'val4': 2}
c {'val1': 1} {'val4': 1}
稍后我可以将字典扩展到单独的列中,因此最终输出将是:
>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
我觉得有一个更简单的解决方案,我们将不胜感激。
df2 = df.melt(id_vars='col1', value_name='count')
pd.crosstab(df2['col1'], df2['count'])
输出:
count val1 val2 val3 val4
col1
a 1 2 3 0
b 0 2 0 2
c 1 0 0 1
如果你想要NaN
:
df3 = pd.crosstab(df2['col1'], df2['count'])
df3.mask(df3.eq(0))
输出:
count val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
print(df)
输出:
col2 val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
另一个结合了 melt、groupby 和 unstack 的选项:
(df.melt('col1')
.groupby(['col1', 'value'])
.size()
.unstack()
.rename_axis(index=None, columns=None)
)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
作为特征工程的一部分,我想使用groupby之后的列的计数作为模型的特征,这是我尝试过的
>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
col1 col2 col3
0 a val1 val3
1 b val2 val4
2 a val2 val3
3 c val1 val4
4 a val2 val3
5 b val2 val4
>>> test = df.groupby('col1').agg(list)
col2 col3
col1
a [val1, val2, val2] [val3, val3, val3]
b [val2, val2] [val4, val4]
c [val1] [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
col2 col3
col1
a {'val1': 1, 'val2': 2} {'val3': 3}
b {'val2': 2} {'val4': 2}
c {'val1': 1} {'val4': 1}
稍后我可以将字典扩展到单独的列中,因此最终输出将是:
>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
我觉得有一个更简单的解决方案,我们将不胜感激。
df2 = df.melt(id_vars='col1', value_name='count')
pd.crosstab(df2['col1'], df2['count'])
输出:
count val1 val2 val3 val4
col1
a 1 2 3 0
b 0 2 0 2
c 1 0 0 1
如果你想要NaN
:
df3 = pd.crosstab(df2['col1'], df2['count'])
df3.mask(df3.eq(0))
输出:
count val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
print(df)
输出:
col2 val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
另一个结合了 melt、groupby 和 unstack 的选项:
(df.melt('col1')
.groupby(['col1', 'value'])
.size()
.unstack()
.rename_axis(index=None, columns=None)
)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0