如何使用 Scikit Learn dictvectorizer 从 Python 中的密集数据帧中获取编码数据帧?
How to use Scikit Learn dictvectorizer to get encoded dataframe from dense dataframe in Python?
我有一个数据框如下:
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
据此我想创建一个编码数据集(用于 fastFM
),如下所示:
user1 user2 user4 user4 item11 item12 item13 item14 affinity
1 0 0 0 0 0 1 0 0.1
0 1 0 0 1 0 0 0 0.4
0 0 1 0 0 0 0 1 0.9
0 0 0 1 0 1 0 0 1.0
我需要 sklearn
的 dictvectorizer
吗?如果是,那么有没有办法将原始数据帧转换为可以提供给 dictvectorizer
的字典,后者又会给我编码的数据集,如图所示?
您可以使用 get_dummies
with concat
If values in columns user
or item
are numeric, cast to string
by astype
:
df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12},
'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
'user': {0: 1, 1: 2, 2: 3, 3: 4}},
columns=['user','item','affinity'])
print df
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
user1 user2 user3 user4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
item11 item12 item13 item14
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 1 0 0
print pd.concat([df1,df2, df.affinity], axis=1)
user1 user2 user3 user4 item11 item12 item13 item14 affinity
0 1 0 0 0 0 0 1 0 0.1
1 0 1 0 0 1 0 0 0 0.4
2 0 0 1 0 0 0 0 1 0.9
3 0 0 0 1 0 1 0 0 1.0
时间:
len(df) = 4
:
In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 690 µs per loop
len(df) = 40
:
df = pd.concat([df]*10).reset_index(drop=True)
In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 719 µs per loop
len(df) = 400
:
df = pd.concat([df]*100).reset_index(drop=True)
In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 748 µs per loop
len(df) = 4k
:
df = pd.concat([df]*1000).reset_index(drop=True)
In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 761 µs per loop
len(df) = 40k
:
df = pd.concat([df]*10000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop
len(df) = 400k
:
df = pd.concat([df]*100000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop
我有一个数据框如下:
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
据此我想创建一个编码数据集(用于 fastFM
),如下所示:
user1 user2 user4 user4 item11 item12 item13 item14 affinity
1 0 0 0 0 0 1 0 0.1
0 1 0 0 1 0 0 0 0.4
0 0 1 0 0 0 0 1 0.9
0 0 0 1 0 1 0 0 1.0
我需要 sklearn
的 dictvectorizer
吗?如果是,那么有没有办法将原始数据帧转换为可以提供给 dictvectorizer
的字典,后者又会给我编码的数据集,如图所示?
您可以使用 get_dummies
with concat
If values in columns user
or item
are numeric, cast to string
by astype
:
df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12},
'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
'user': {0: 1, 1: 2, 2: 3, 3: 4}},
columns=['user','item','affinity'])
print df
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
user1 user2 user3 user4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
item11 item12 item13 item14
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 1 0 0
print pd.concat([df1,df2, df.affinity], axis=1)
user1 user2 user3 user4 item11 item12 item13 item14 affinity
0 1 0 0 0 0 0 1 0 0.1
1 0 1 0 0 1 0 0 0 0.4
2 0 0 1 0 0 0 0 1 0.9
3 0 0 0 1 0 1 0 0 1.0
时间:
len(df) = 4
:
In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 690 µs per loop
len(df) = 40
:
df = pd.concat([df]*10).reset_index(drop=True)
In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 719 µs per loop
len(df) = 400
:
df = pd.concat([df]*100).reset_index(drop=True)
In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 748 µs per loop
len(df) = 4k
:
df = pd.concat([df]*1000).reset_index(drop=True)
In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 761 µs per loop
len(df) = 40k
:
df = pd.concat([df]*10000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop
len(df) = 400k
:
df = pd.concat([df]*100000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop