复制行作为 pandas 中的字典用于特征提取

Replicated rows as dictionary in pandas for feature extraction

我有一个这样的 pandas 数据框

UID    URL    IMP
UID1   URLX   10
UID1   URLY   1
UID3   URLX   100
UID4   URLY   2 
UID2   URLY   10
UID2   URLZ   1

我想简化数据框,以便每个 UID 有一行,第二列有一个字典

UID   DICT
UID1  [{url:URLX,impressions:10},{url:URLY,impressions:1}]
UID2  [{url:URLY,impressions:10},{url:URLZ,impressions:1}]
UID3  [{url:URLX,impressions:100}]
UID4  [{url:URLY,impressions:2}]

然后创建特征向量以计算相似度:

UID   FEATURE
UID1  [10,1,0]
UID2  [0,10,1]
UID3  [100,0,0]
UID4  [0,2,0]

谢谢!

IIUC:

In [55]: df.groupby('UID')[df.columns.drop('UID').tolist()] \
           .apply(lambda x: x.to_dict('r')) \
           .reset_index(name='DICT')
Out[55]:
    UID                                               DICT
0  UID1  [{'URL': 'URLX', 'IMP': 10}, {'URL': 'URLY', '...
1  UID2  [{'URL': 'URLY', 'IMP': 10}, {'URL': 'URLZ', '...
2  UID3                      [{'URL': 'URLX', 'IMP': 100}]
3  UID4                        [{'URL': 'URLY', 'IMP': 2}]

In [52]: df.groupby('UID')['IMP'].apply(lambda x: x.tolist()).reset_index(name='FEATURE')
Out[52]:
    UID  FEATURE
0  UID1  [10, 1]
1  UID2  [10, 1]
2  UID3    [100]
3  UID4      [2]

对于第一位,使用df.groupby:

In [888]: df.groupby('UID').apply(lambda x: x[['URL', 'IMP']].to_dict('r'))
Out[888]: 
UID
UID1    [{u'URL': u'URLX', u'IMP': 10}, {u'URL': u'URL...
UID2    [{u'URL': u'URLY', u'IMP': 10}, {u'URL': u'URL...
UID3                     [{u'URL': u'URLX', u'IMP': 100}]
UID4                       [{u'URL': u'URLY', u'IMP': 2}]

并且,对于第二位,使用 df.pivot:

In [900]: df.pivot(index='UID', columns='URL', values='IMP').fillna(0).astype(int)
Out[900]: 
URL   URLX  URLY  URLZ
UID                   
UID1    10     1     0
UID2     0    10     1
UID3   100     0     0
UID4     0     2     0

如果您想要的是矢量,请尝试:

In [923]: df_new = df[['UID']].sort_values('UID').drop_duplicates()

In [924]: df_new['FEATURE'] = df.pivot(index='UID', columns='URL', values='IMP').fillna(0).astype(int).values.tolist()

In [925]: df_new
Out[925]: 
    UID      FEATURE
0  UID1   [10, 1, 0]
4  UID2   [0, 10, 1]
2  UID3  [100, 0, 0]
3  UID4    [0, 2, 0]