如果值在列表中，则创建虚拟变量

Question

我正在使用找到的 Zomato Bangalore Restaurant 数据集 here。我的预处理步骤之一是为每家餐厅提供的菜肴类型创建虚拟变量。我使用 panda 的 explode 来划分菜系，并为排名前 30 位的菜系和排名前 30 位的菜系创建了列表。我在下面创建了一个示例数据框。


    sample_df = pd.DataFrame({
        'name': ['Jalsa', 'Spice Elephant', 'San Churro Cafe'],
        'cuisines_lst': [
            ['North Indian', 'Chinese'],
            ['Chinese', 'North Indian', 'Thai'],
            ['Cafe', 'Mexican', 'Italian']
        ]
    })

我已经创建了顶部列表而不是顶部列表。在实际数据中，我使用的是前 30 名，但为了示例，它是前 2 名而不是前 2 名。


top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[0:2].tolist()
not_top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[2:].tolist()

我想要的是为顶部列表中的所有菜系创建一个虚拟变量，后缀为 _bin 并创建一个最终的虚拟变量 other 如果餐厅有来自不是最重要的名单。所需的输出如下所示：

名字	cuisines_lst	Chinese_bin	北Indian_bin	其他
贾尔萨	[北印度、中国]	1	1	0
香料大象	[中国人、北印度人、泰国人]	1	1	1
圣油条咖啡馆	[咖啡馆、墨西哥菜、意大利菜]	0	0	1

Answer 1

创建虚拟对象，然后减少重复索引以获得前 2 列：

a = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
    .reset_index().groupby('index')[top2].sum().add_suffix('_bin')

如果您希望按字母顺序排列（在本例中，中文后面是北印度语），请添加一个中间步骤以使用 a.sort_index(axis=1).

对列进行排序

对其他值执行相同的操作，但通过将 axis=1 传递给 any 来减少列数：

b = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
    .reset_index().groupby('index')[not_top2].sum() \
    .any(axis=1).astype(int).rename('Other')

连接索引：

>>> print(pd.concat([sample_df, a, b], axis=1).to_string())
              name                   cuisines_lst  North Indian_bin  Chinese_bin  Other
0            Jalsa        [North Indian, Chinese]                 1            1      0
1   Spice Elephant  [Chinese, North Indian, Thai]                 1            1      1
2  San Churro Cafe       [Cafe, Mexican, Italian]                 0            0      1

如果您正在对大量数据进行操作，那么创建一个中间数据框可能是一种策略，其中包含可以对其执行分组操作的分解虚拟对象。

如果值在列表中，则创建虚拟变量

Create dummy variables if value is in list

python

pandas

dummy-variable