复制行作为 pandas 中的字典用于特征提取
Replicated rows as dictionary in pandas for feature extraction
我有一个这样的 pandas 数据框
UID URL IMP
UID1 URLX 10
UID1 URLY 1
UID3 URLX 100
UID4 URLY 2
UID2 URLY 10
UID2 URLZ 1
我想简化数据框,以便每个 UID 有一行,第二列有一个字典
UID DICT
UID1 [{url:URLX,impressions:10},{url:URLY,impressions:1}]
UID2 [{url:URLY,impressions:10},{url:URLZ,impressions:1}]
UID3 [{url:URLX,impressions:100}]
UID4 [{url:URLY,impressions:2}]
然后创建特征向量以计算相似度:
UID FEATURE
UID1 [10,1,0]
UID2 [0,10,1]
UID3 [100,0,0]
UID4 [0,2,0]
谢谢!
IIUC:
In [55]: df.groupby('UID')[df.columns.drop('UID').tolist()] \
.apply(lambda x: x.to_dict('r')) \
.reset_index(name='DICT')
Out[55]:
UID DICT
0 UID1 [{'URL': 'URLX', 'IMP': 10}, {'URL': 'URLY', '...
1 UID2 [{'URL': 'URLY', 'IMP': 10}, {'URL': 'URLZ', '...
2 UID3 [{'URL': 'URLX', 'IMP': 100}]
3 UID4 [{'URL': 'URLY', 'IMP': 2}]
和
In [52]: df.groupby('UID')['IMP'].apply(lambda x: x.tolist()).reset_index(name='FEATURE')
Out[52]:
UID FEATURE
0 UID1 [10, 1]
1 UID2 [10, 1]
2 UID3 [100]
3 UID4 [2]
对于第一位,使用df.groupby
:
In [888]: df.groupby('UID').apply(lambda x: x[['URL', 'IMP']].to_dict('r'))
Out[888]:
UID
UID1 [{u'URL': u'URLX', u'IMP': 10}, {u'URL': u'URL...
UID2 [{u'URL': u'URLY', u'IMP': 10}, {u'URL': u'URL...
UID3 [{u'URL': u'URLX', u'IMP': 100}]
UID4 [{u'URL': u'URLY', u'IMP': 2}]
并且,对于第二位,使用 df.pivot
:
In [900]: df.pivot(index='UID', columns='URL', values='IMP').fillna(0).astype(int)
Out[900]:
URL URLX URLY URLZ
UID
UID1 10 1 0
UID2 0 10 1
UID3 100 0 0
UID4 0 2 0
如果您想要的是矢量,请尝试:
In [923]: df_new = df[['UID']].sort_values('UID').drop_duplicates()
In [924]: df_new['FEATURE'] = df.pivot(index='UID', columns='URL', values='IMP').fillna(0).astype(int).values.tolist()
In [925]: df_new
Out[925]:
UID FEATURE
0 UID1 [10, 1, 0]
4 UID2 [0, 10, 1]
2 UID3 [100, 0, 0]
3 UID4 [0, 2, 0]
我有一个这样的 pandas 数据框
UID URL IMP
UID1 URLX 10
UID1 URLY 1
UID3 URLX 100
UID4 URLY 2
UID2 URLY 10
UID2 URLZ 1
我想简化数据框,以便每个 UID 有一行,第二列有一个字典
UID DICT
UID1 [{url:URLX,impressions:10},{url:URLY,impressions:1}]
UID2 [{url:URLY,impressions:10},{url:URLZ,impressions:1}]
UID3 [{url:URLX,impressions:100}]
UID4 [{url:URLY,impressions:2}]
然后创建特征向量以计算相似度:
UID FEATURE
UID1 [10,1,0]
UID2 [0,10,1]
UID3 [100,0,0]
UID4 [0,2,0]
谢谢!
IIUC:
In [55]: df.groupby('UID')[df.columns.drop('UID').tolist()] \
.apply(lambda x: x.to_dict('r')) \
.reset_index(name='DICT')
Out[55]:
UID DICT
0 UID1 [{'URL': 'URLX', 'IMP': 10}, {'URL': 'URLY', '...
1 UID2 [{'URL': 'URLY', 'IMP': 10}, {'URL': 'URLZ', '...
2 UID3 [{'URL': 'URLX', 'IMP': 100}]
3 UID4 [{'URL': 'URLY', 'IMP': 2}]
和
In [52]: df.groupby('UID')['IMP'].apply(lambda x: x.tolist()).reset_index(name='FEATURE')
Out[52]:
UID FEATURE
0 UID1 [10, 1]
1 UID2 [10, 1]
2 UID3 [100]
3 UID4 [2]
对于第一位,使用df.groupby
:
In [888]: df.groupby('UID').apply(lambda x: x[['URL', 'IMP']].to_dict('r'))
Out[888]:
UID
UID1 [{u'URL': u'URLX', u'IMP': 10}, {u'URL': u'URL...
UID2 [{u'URL': u'URLY', u'IMP': 10}, {u'URL': u'URL...
UID3 [{u'URL': u'URLX', u'IMP': 100}]
UID4 [{u'URL': u'URLY', u'IMP': 2}]
并且,对于第二位,使用 df.pivot
:
In [900]: df.pivot(index='UID', columns='URL', values='IMP').fillna(0).astype(int)
Out[900]:
URL URLX URLY URLZ
UID
UID1 10 1 0
UID2 0 10 1
UID3 100 0 0
UID4 0 2 0
如果您想要的是矢量,请尝试:
In [923]: df_new = df[['UID']].sort_values('UID').drop_duplicates()
In [924]: df_new['FEATURE'] = df.pivot(index='UID', columns='URL', values='IMP').fillna(0).astype(int).values.tolist()
In [925]: df_new
Out[925]:
UID FEATURE
0 UID1 [10, 1, 0]
4 UID2 [0, 10, 1]
2 UID3 [100, 0, 0]
3 UID4 [0, 2, 0]