具有过滤条件的 groupby 函数

grouby function with filtered conditions

f = pd.DataFrame({'Movie': ['name1','name2','name3']
                  'genre': [['comedy', 'action'];['comedy','scifi']; 
                            ['thriller','action']]
                  'distributor': ['disney', 'disney','parmount'})

#如果 genre 中有多个值,现在名称是 genre[0] 和 genre[1] 的一部分,如果我使用 groupby

res = f[f['distributor'] == 'disney'].groupby(['genre'])

期望的输出

只想要迪士尼推出的电影

distributor     genre     count of movies
   disney        action        1
   disney        comedy        2
   disney         scifi         1

分解你的列表然后计算值:

out = df.loc[df['distributor'] == 'disney', 'genre'].explode().value_counts()
print(out)

# Output
comedy    2
action    1
scifi     1
Name: genre, dtype: int64

更新

out = (df.explode('genre').query("distributor == 'disney'")
        .value_counts(['distributor', 'genre'], sort=False)
        .rename('count').reset_index())
print(out)

# Output
  distributor   genre  count
0      disney  action      1
1      disney  comedy      2
2      disney   scifi      1

更新 2

您的 genre 列似乎不包含列表,而是包含字符串。在使用上面的代码之前,尝试将此列转换为带有 ast.literal_eval 的列表:

import ast

df['genre'] = df['genre'].str.replace(';', ',').apply(ast.literal_eval)

# OR

df['genre'] = pd.eval(df['genre'].str.replace(';', ','))

# Execute now df.explode(...)...

使用重新构想 pandas 的 API 的 datar 简单明了:

>>> import pandas as pd
>>> df = pd.DataFrame({'Movie': ['name1','name2','name3'],
...                   'genre': [['comedy', 'action'], ['comedy','scifi'],
...                             ['thriller','action']],
...                   'distributor': ['disney', 'disney','parmount']})
>>> df
   Movie               genre distributor
0  name1    [comedy, action]      disney
1  name2     [comedy, scifi]      disney
2  name3  [thriller, action]    paramount
>>>
>>> from datar.all import f, filter, unchop, count
[2022-03-31 11:47:44][datar][WARNING] Builtin name "filter" has been overriden by datar.
>>> (
...     df 
...     >> filter(f.distributor == "disney") 
...     >> unchop(f.genre) 
...     >> count(f.distributor, f.genre)
... )
  distributor    genre       n
     <object> <object> <int64>
0      disney   comedy       2
1      disney   action       1
2      disney    scifi       1
[TibbleGrouped: distributor (n=1)]