Python Pandas: 如何从列表列创建二进制矩阵?
Python Pandas: How to create a binary matrix from column of lists?
我有一个 Python Pandas DataFrame,如下所示:
1
0 a, b
1 c
2 d
3 e
a, b
是表示用户特征列表的字符串
如何将其转换为用户特征的二进制矩阵,如下所示:
a b c d e
0 1 1 0 0 0
1 0 0 1 0 0
2 0 0 0 1 0
3 0 0 0 0 1
我看到了一个类似的问题,但该列不包含列表条目。
我已经尝试过这些方法,有没有办法合并两者:
pd.get_dummies()
pd.get_dummies(df[1])
a, b c d e
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df[1].apply(lambda x: pd.Series(x.split()))
1
0 a, b
1 c
2 d
3 e
也对创建此类二进制矩阵的不同方法感兴趣!
感谢任何帮助!
谢谢
我认为你可以使用:
df = df.iloc[:,0].str.split(', ', expand=True)
.stack()
.reset_index(drop=True)
.str.get_dummies()
print df
a b c d e
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
已编辑:
print df.iloc[:,0].str.replace(' ','').str.get_dummies(sep=',')
a b c d e
0 1 1 0 0 0
1 0 0 1 0 0
2 0 0 0 1 0
3 0 0 0 0 1
不久前我写了一个支持分组的通用函数:
def sublist_uniques(data,sublist):
categories = set()
for d,t in data.iterrows():
try:
for j in t[sublist]:
categories.add(j)
except:
pass
return list(categories)
def sublists_to_dummies(f,sublist,index_key = None):
categories = sublist_uniques(f,sublist)
frame = pd.DataFrame(columns=categories)
for d,i in f.iterrows():
if type(i[sublist]) == list or np.array:
try:
if index_key != None:
key = i[index_key]
f =np.zeros(len(categories))
for j in i[sublist]:
f[categories.index(j)] = 1
if key in frame.index:
for j in i[sublist]:
frame.loc[key][j]+=1
else:
frame.loc[key]=f
else:
f =np.zeros(len(categories))
for j in i[sublist]:
f[categories.index(j)] = 1
frame.loc[d]=f
except:
pass
return frame
In [15]: a
Out[15]:
a group labels
0 1 new [a, d]
1 2 old [a, g, h]
2 3 new [i, m, a]
In [16]: sublists_to_dummies(a,'labels')
Out[16]:
a d g i h m
0 1 1 0 0 0 0
1 1 0 1 0 1 0
2 1 0 0 1 0 1
In [17]: sublists_to_dummies(a,'labels','group')
Out[17]:
a d g i h m
new 2 1 0 1 0 1
old 1 0 1 0 1 0
我有一个 Python Pandas DataFrame,如下所示:
1
0 a, b
1 c
2 d
3 e
a, b
是表示用户特征列表的字符串
如何将其转换为用户特征的二进制矩阵,如下所示:
a b c d e
0 1 1 0 0 0
1 0 0 1 0 0
2 0 0 0 1 0
3 0 0 0 0 1
我看到了一个类似的问题
我已经尝试过这些方法,有没有办法合并两者:
pd.get_dummies()
pd.get_dummies(df[1])
a, b c d e
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df[1].apply(lambda x: pd.Series(x.split()))
1
0 a, b
1 c
2 d
3 e
也对创建此类二进制矩阵的不同方法感兴趣!
感谢任何帮助!
谢谢
我认为你可以使用:
df = df.iloc[:,0].str.split(', ', expand=True)
.stack()
.reset_index(drop=True)
.str.get_dummies()
print df
a b c d e
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
已编辑:
print df.iloc[:,0].str.replace(' ','').str.get_dummies(sep=',')
a b c d e
0 1 1 0 0 0
1 0 0 1 0 0
2 0 0 0 1 0
3 0 0 0 0 1
不久前我写了一个支持分组的通用函数:
def sublist_uniques(data,sublist):
categories = set()
for d,t in data.iterrows():
try:
for j in t[sublist]:
categories.add(j)
except:
pass
return list(categories)
def sublists_to_dummies(f,sublist,index_key = None):
categories = sublist_uniques(f,sublist)
frame = pd.DataFrame(columns=categories)
for d,i in f.iterrows():
if type(i[sublist]) == list or np.array:
try:
if index_key != None:
key = i[index_key]
f =np.zeros(len(categories))
for j in i[sublist]:
f[categories.index(j)] = 1
if key in frame.index:
for j in i[sublist]:
frame.loc[key][j]+=1
else:
frame.loc[key]=f
else:
f =np.zeros(len(categories))
for j in i[sublist]:
f[categories.index(j)] = 1
frame.loc[d]=f
except:
pass
return frame
In [15]: a Out[15]: a group labels 0 1 new [a, d] 1 2 old [a, g, h] 2 3 new [i, m, a] In [16]: sublists_to_dummies(a,'labels') Out[16]: a d g i h m 0 1 1 0 0 0 0 1 1 0 1 0 1 0 2 1 0 0 1 0 1 In [17]: sublists_to_dummies(a,'labels','group') Out[17]: a d g i h m new 2 1 0 1 0 1 old 1 0 1 0 1 0