Pandas：如何在 Dataframe 中添加表示 2 个属性交集的列

Question

假设我有 2 个 csv 文件（非常大的文件），

第一个文件代表餐厅，有 6 个属性 restaurant_id、name、star_rating、city、zone、closed

第二个文件表示餐厅的类别，有 2 个属性 restaurant_id 和 category

所以，我想做的基本上是在我的功能中添加一个名为 zone_categories_intersection 的列，告诉我同一地区（区域）中至少共享一个的餐厅数量相关餐厅的类别。

因为这是我第一次使用 pandas 图书馆，所以在操作表格时我有点难以流利。我做了类似的事情来计算出与相关餐厅相关的区域中的餐厅数量，并将其添加到我的功能列中。


restaurants['nb_restaurants_zone'] = restaurants.groupby('zone')['zone'].transform('size')
restaurants.head()

features = restaurants[['restaurant_id', 'moyenne_etoiles', 'ville', 'zone', 'ferme', 'nb_restaurants_zone']].copy()
features.head()

#edit
merged = restaurants.merge(categories, on='restaurant_id')
merged.head()

我考虑过添加 category.csv 文件并将其与餐厅合并并将类别映射到相应的 id，然后想出一种方法来应用第二个条件（至少共享有问题的餐厅的一个类别）...但我真的不知道该怎么做

谢谢

Answer 1

试试这个

# sample data
# (it's not exactly your provided data
# but it is better to show how the code works)
# please always provide a callable line of code
# you could get it with `df.head().to_dict('split')`
rest = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Denny\'s', 'Ike\'s Love & Sandwiches', 'Midori Japanese',
        'Pho U', 'John & Sons Oysters'],
    'avg_stars': [2.5, 4, 3.5, 3.5, 4],
    'city': ['Las Vegas', 'Phoenix', 'Calgary', 'Toronto', 'Toronto'],
    'zone': ['a', 'a', 'b', 'b', 'a']
})
cats = pd.DataFrame([
    [1, ['Breakfast', 'Dinners', 'American']],
    [2, ['Sandwiches', 'American']],
    [3, ['Japanese']],
    [4, ['Japanese']],
    [5, ['American', 'Seafood']]
], columns=['id', 'category']).explode('category')

代码

# add zone to categories dataframe
cats2 = cats.merge(rest[['id', 'zone']], on='id')

# add count for zone & category
cats2['zone_cat_count'] = (
    cats2.groupby(['zone', 'category'])
    .transform('count')
)

# merge with rest dataframe
rest = rest.merge(
    cats2.groupby('id')['zone_cat_count'].max()
    , on='id'
)

输出

   id                     name  avg_stars       city zone  zone_cat_count
0   1                  Denny's        2.5  Las Vegas    a               3
1   2  Ike's Love & Sandwiches        4.0    Phoenix    a               3
2   3          Midori Japanese        3.5    Calgary    b               2
3   4                    Pho U        3.5    Toronto    b               2
4   5      John & Sons Oysters        4.0    Toronto    a               3

Pandas：如何在 Dataframe 中添加表示 2 个属性交集的列

Pandas: how to add column representing the intersection of 2 attributes in a Dataframe

python

feature-extraction

pandas

feature-engineering