数据集中电影类型的编码

Question

我有一个电影数据集，其中有一列列出了电影的类型：

title    genres          
t1       ['Drama', 'Science Fiction', 'War']
t2       ['Action', 'Crime']

我想将它们编码为：

title  Drama  Science  Fiction  War  Action  Crime
t1     1      1                 1    0       0
t2     0      0                 0    1       1

我试过 MultiLabelBinarizer，但输出结果是：

    ,   A   D   F   S   W   a   c   d   e   i   m   n   o   r   t   u   v
0   1   1   1   0   1   1   0   0   1   1   1   1   0   1   1   1   1   1   1
1   1   1   0   1   1   1   1   1   1   0   1   1   1   1   1   1   1   0   0

我该如何解决这个问题？我还有其他方法可以实现吗？

如有任何帮助，我们将不胜感激。

Answer 1

考虑到这是你的 df:

    title   genres
0   t1  [Drama, Science Fiction, War]
1   t2  [Action, Crime]

你应该这样做：

# edit
# consider adding this line if your df.genre is a string of list
df.genres = df.genres.apply(lambda x: eval(x))

exploded_df = df.explode(column='genres')
pd.get_dummies(exploded_df, columns=['genres']).groupby('title', as_index=False).sum()

# output
  title genres_Action   genres_Crime    genres_Drama    genres_Science Fiction  genres_War
0   t1  0               0               1               1                       1
1   t2  1               1               0               0                       0

数据集中电影类型的编码

Encoding for movie genres in dataset

python

encoding

multilabel-classification