将类别列表数据格式化为邻接矩阵的最佳方法是什么？

Question

我有数据打算输入 sklearn 模型。一些列是类别列表（它的电影数据，例如一列是 {genres: [comedy, horror]}）。

我该怎么做才能处理这些列，以便输入到模型中的是一个邻接矩阵，其中该行然后具有如下所示的一些数据？

{comedy: 1, action: 0, horror: 1, documentary: 0}

Answer 1

您要查找的预处理器是 LabelBinarizer

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer

data = [{'genres': ['comedy', 'horror']}, {'genres': ['action', 'documentary']}]
df = pd.DataFrame(data)

# explode the list to separate rows
X = pd.concat([
        pd.DataFrame(v, index=np.repeat(k,len(v)), columns=['genre']) 
            for k,v in df.genres.to_dict().items()])

lb = LabelBinarizer()
# make the binary fields
dd = pd.DataFrame(lb.fit_transform(X), index=df_exploded.index, columns=lb.classes_)
dd.groupby(dd.index).max()

给予

   action  comedy  documentary  horror
0       0       1            0       1
1       1       0            1       0

将类别列表数据格式化为邻接矩阵的最佳方法是什么？

What is the best method to format data that is a list of categories into an adjacency matrix?

data-processing

python-3.x

scikit-learn