Pandas groupby 字典
Pandas groupby dictionary
pandas 的新手,抱歉,如果解决方案很明显。
我有一个数据框(见下文),其中包含不同的电影场景和该电影场景的环境
import pandas as pd
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'},
{'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'},
{'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'},
{'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'},
{'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'},
{'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'},
{'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'},
{'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }]
myDF = pd.DataFrame(data)
在这种情况下,电影有它们所属的多种流派。我有一本字典(下面)描述了每部电影属于哪种类型
genreDict = {'movie_X' : ['romance', 'action'],
'movie_Y' : ['comedy', 'romance', 'action'],
'movie_Z' : ['horror', 'thriller', 'romance']}
我想通过这本词典对 myDF 进行分组,特别是能够告诉特定环境在特定类型中出现的次数(例如,在类型恐怖中,'boat' 被计数一次, 'beach'被统计了一次,'home'被统计了一次)。最好和最有效的方法是什么?我尝试将字典映射到数据框,然后按列表分组:
myDF['genres'] = myDF['movie'].map(genreDict)
哪个returns:
movie scene environment genres
0 movie_X 1 home [romance, action]
1 movie_X 2 car [romance, action]
2 movie_X 3 home [romance, action]
3 movie_Y 1 home [comedy, romance, action]
4 movie_Y 2 office [comedy, romance, action]
5 movie_Z 1 boat [horror, thriller, romance]
6 movie_Z 2 beach [horror, thriller, romance]
7 movie_Z 3 home [horror, thriller, romance]
但是,我收到一条错误消息,指出该列表不可散列。希望大家能帮忙:)
非标量对象通常会在 pandas 中引起问题。除此之外,您还需要整理数据,以便接下来的步骤更容易(表格结构的主要操作通常在整理数据集上定义)。您需要一个数据集,其中您不会在一行中列出所有类型,而是每个类型都有自己的行。
这是实现该目标的可能方法之一:
genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist())
df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True))
df
Out:
environment movie scene genre
0 home movie_X 1 romance
0 home movie_X 1 action
1 car movie_X 2 romance
1 car movie_X 2 action
2 home movie_X 3 romance
2 home movie_X 3 action
3 home movie_Y 1 comedy
3 home movie_Y 1 romance
3 home movie_Y 1 action
4 office movie_Y 2 comedy
4 office movie_Y 2 romance
4 office movie_Y 2 action
5 boat movie_Z 1 horror
5 boat movie_Z 1 thriller
5 boat movie_Z 1 romance
6 beach movie_Z 2 horror
6 beach movie_Z 2 thriller
6 beach movie_Z 2 romance
7 home movie_Z 3 horror
7 home movie_Z 3 thriller
7 home movie_Z 3 romance
一旦你有了这样的结构,就可以更容易地分组或交叉制表你的数据:
df.groupby('genre').size()
Out:
genre
action 5
comedy 2
horror 3
romance 8
thriller 3
dtype: int64
pd.crosstab(df['genre'], df['environment'])
Out:
environment beach boat car home office
genre
action 0 0 1 3 1
comedy 0 0 0 1 1
horror 1 1 0 1 0
romance 1 1 1 4 1
thriller 1 1 0 1 0
这是 Hadley Wickham 的精彩读物:Tidy Data。
如果更大的数据帧更快,则使用 numpy
重复行 lists
和 numpy.repeat
, numpy.concatenate
and Index.values
:
#get length of lists in column genres
l = myDF['genres'].str.len()
#convert column to numpy array
vals = myDF['genres'].values
#repeat index by lenghts
idx = np.repeat(myDF.index, l)
#expand rows by duplicated index values
myDF = myDF.loc[idx]
#flattening lists column
myDF['genres'] = np.concatenate(vals)
#default monotonic index (0,1,2...)
myDF = myDF.reset_index(drop=True)
print (myDF)
environment movie scene genres
0 home movie_X 1 romance
1 home movie_X 1 action
2 car movie_X 2 romance
3 car movie_X 2 action
4 home movie_X 3 romance
5 home movie_X 3 action
6 home movie_Y 1 comedy
7 home movie_Y 1 romance
8 home movie_Y 1 action
9 office movie_Y 2 comedy
10 office movie_Y 2 romance
11 office movie_Y 2 action
12 boat movie_Z 1 horror
13 boat movie_Z 1 thriller
14 boat movie_Z 1 romance
15 beach movie_Z 2 horror
16 beach movie_Z 2 thriller
17 beach movie_Z 2 romance
18 home movie_Z 3 horror
19 home movie_Z 3 thriller
20 home movie_Z 3 romance
然后使用groupby
and aggregate size
:
df1 = df.groupby(['genres','environment']).size().reset_index(name='count')
print (df1)
genres environment count
0 action car 1
1 action home 3
2 action office 1
3 comedy home 1
4 comedy office 1
5 horror beach 1
6 horror boat 1
7 horror home 1
8 romance beach 1
9 romance boat 1
10 romance car 1
11 romance home 4
12 romance office 1
13 thriller beach 1
14 thriller boat 1
15 thriller home 1
pandas 的新手,抱歉,如果解决方案很明显。
我有一个数据框(见下文),其中包含不同的电影场景和该电影场景的环境
import pandas as pd
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'},
{'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'},
{'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'},
{'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'},
{'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'},
{'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'},
{'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'},
{'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }]
myDF = pd.DataFrame(data)
在这种情况下,电影有它们所属的多种流派。我有一本字典(下面)描述了每部电影属于哪种类型
genreDict = {'movie_X' : ['romance', 'action'],
'movie_Y' : ['comedy', 'romance', 'action'],
'movie_Z' : ['horror', 'thriller', 'romance']}
我想通过这本词典对 myDF 进行分组,特别是能够告诉特定环境在特定类型中出现的次数(例如,在类型恐怖中,'boat' 被计数一次, 'beach'被统计了一次,'home'被统计了一次)。最好和最有效的方法是什么?我尝试将字典映射到数据框,然后按列表分组:
myDF['genres'] = myDF['movie'].map(genreDict)
哪个returns:
movie scene environment genres
0 movie_X 1 home [romance, action]
1 movie_X 2 car [romance, action]
2 movie_X 3 home [romance, action]
3 movie_Y 1 home [comedy, romance, action]
4 movie_Y 2 office [comedy, romance, action]
5 movie_Z 1 boat [horror, thriller, romance]
6 movie_Z 2 beach [horror, thriller, romance]
7 movie_Z 3 home [horror, thriller, romance]
但是,我收到一条错误消息,指出该列表不可散列。希望大家能帮忙:)
非标量对象通常会在 pandas 中引起问题。除此之外,您还需要整理数据,以便接下来的步骤更容易(表格结构的主要操作通常在整理数据集上定义)。您需要一个数据集,其中您不会在一行中列出所有类型,而是每个类型都有自己的行。
这是实现该目标的可能方法之一:
genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist())
df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True))
df
Out:
environment movie scene genre
0 home movie_X 1 romance
0 home movie_X 1 action
1 car movie_X 2 romance
1 car movie_X 2 action
2 home movie_X 3 romance
2 home movie_X 3 action
3 home movie_Y 1 comedy
3 home movie_Y 1 romance
3 home movie_Y 1 action
4 office movie_Y 2 comedy
4 office movie_Y 2 romance
4 office movie_Y 2 action
5 boat movie_Z 1 horror
5 boat movie_Z 1 thriller
5 boat movie_Z 1 romance
6 beach movie_Z 2 horror
6 beach movie_Z 2 thriller
6 beach movie_Z 2 romance
7 home movie_Z 3 horror
7 home movie_Z 3 thriller
7 home movie_Z 3 romance
一旦你有了这样的结构,就可以更容易地分组或交叉制表你的数据:
df.groupby('genre').size()
Out:
genre
action 5
comedy 2
horror 3
romance 8
thriller 3
dtype: int64
pd.crosstab(df['genre'], df['environment'])
Out:
environment beach boat car home office
genre
action 0 0 1 3 1
comedy 0 0 0 1 1
horror 1 1 0 1 0
romance 1 1 1 4 1
thriller 1 1 0 1 0
这是 Hadley Wickham 的精彩读物:Tidy Data。
如果更大的数据帧更快,则使用 numpy
重复行 lists
和 numpy.repeat
, numpy.concatenate
and Index.values
:
#get length of lists in column genres
l = myDF['genres'].str.len()
#convert column to numpy array
vals = myDF['genres'].values
#repeat index by lenghts
idx = np.repeat(myDF.index, l)
#expand rows by duplicated index values
myDF = myDF.loc[idx]
#flattening lists column
myDF['genres'] = np.concatenate(vals)
#default monotonic index (0,1,2...)
myDF = myDF.reset_index(drop=True)
print (myDF)
environment movie scene genres
0 home movie_X 1 romance
1 home movie_X 1 action
2 car movie_X 2 romance
3 car movie_X 2 action
4 home movie_X 3 romance
5 home movie_X 3 action
6 home movie_Y 1 comedy
7 home movie_Y 1 romance
8 home movie_Y 1 action
9 office movie_Y 2 comedy
10 office movie_Y 2 romance
11 office movie_Y 2 action
12 boat movie_Z 1 horror
13 boat movie_Z 1 thriller
14 boat movie_Z 1 romance
15 beach movie_Z 2 horror
16 beach movie_Z 2 thriller
17 beach movie_Z 2 romance
18 home movie_Z 3 horror
19 home movie_Z 3 thriller
20 home movie_Z 3 romance
然后使用groupby
and aggregate size
:
df1 = df.groupby(['genres','environment']).size().reset_index(name='count')
print (df1)
genres environment count
0 action car 1
1 action home 3
2 action office 1
3 comedy home 1
4 comedy office 1
5 horror beach 1
6 horror boat 1
7 horror home 1
8 romance beach 1
9 romance boat 1
10 romance car 1
11 romance home 4
12 romance office 1
13 thriller beach 1
14 thriller boat 1
15 thriller home 1