计算列表中每个项目在 pandas 数据框列中出现的次数,用逗号分隔值与其他列的附加聚合
Count number of times each item in list occurs in a pandas dataframe column with comma separates values with additional aggregation of other columns
我有一个列表:
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
和具有这些值的 pandas Dataframe df1
first last city email duration
John Travis New York a@email.com 5.5
Jim Perterson San Francisco, Los Angeles b@email.com 6.8
Nancy Travis Chicago b1@email.com 1.2
Jake Templeton Los Angeles b3@email.com 4.9
John Myers New York b4@email.com 1.9
Peter Johnson San Francisco, Chicago b5@email.col 2.3
Aby Peters Los Angeles b6@email.com 1.8
Amy Thomas San Francisco b7@email.col 8.8
Jessica Thompson Los Angeles, Chicago, New York b8@email.com 4.2
我想计算 citylist 中每个城市在数据框列 'city' 中出现的次数(这部分有效,感谢@scott-boston 在我之前的问题中 )
(df1['city'].str.split(', ')
.explode()
.value_counts(sort=False)
.reindex(citylist, fill_value=0))
此外,我想按列 'duration' 和按城市分组求和,并计算百分比(组持续时间总和)/(总持续时间)
city list duration %time
New York 3 11.6 0.31
San Francisco 3 17.9 0.47
Los Angeles 4 17.7 0.47
Chicago 3 7.7 0.20
Miami 0 0 0
- 您可以分解
city
列上的数据框
- 然后 groupby
city
并使用 .agg
进行一些计算。
- 对于
%time
,可以在开头创建一个变量var
,用来获取duration
列的总和,后面会用到获取%总计
- 最后,使用一些列表理解来包含
citylist
中不在数据框中的城市行:
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
var = df['duration'].sum() #to be used later for %time column calculation
df['city'] = df['city'].str.split(', ') # change from string to list in preparation for explode
df = (df.explode('city')
.groupby('city').agg({'email' : 'count', 'duration' : 'sum'}).reset_index()
.rename({'email' : 'list'}, axis=1))
df['%time'] = round(df['duration'] / var, 2)
df = df.append(pd.DataFrame({'city' : [x for x in citylist if x not in df['city'].unique()]})).fillna(0)
df
Out[1]:
city list duration %time
0 Chicago 3.0 7.7 0.21
1 Los Angeles 4.0 17.7 0.47
2 New York 3.0 11.6 0.31
3 San Francisco 3.0 17.9 0.48
0 Miami 0.0 0.0 0.00
解决方案 #2:根据@ScottBoston 的评论,使用 reindex
比列表理解更简洁、更好。您也可以在他的回答中看到这一点 )
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
var = df['duration'].sum() #to be used later for %time column calculation
df['city'] = df['city'].str.split(', ') # change from string to list in preparation for explode
df = (df.explode('city')
.groupby('city').agg({'email' : 'count', 'duration' : 'sum'})
.rename({'email' : 'list'}, axis=1))
df['%time'] = round(df['duration'] / var, 2)
df.reindex(citylist, fill_value=0).reset_index()
输出:
city list duration %time
0 New York 3 11.4 0.31
1 San Francisco 3 17.9 0.48
2 Los Angeles 4 17.5 0.47
3 Chicago 3 7.5 0.20
4 Miami 0 0.0 0.00
我有一个列表:
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
和具有这些值的 pandas Dataframe df1
first last city email duration
John Travis New York a@email.com 5.5
Jim Perterson San Francisco, Los Angeles b@email.com 6.8
Nancy Travis Chicago b1@email.com 1.2
Jake Templeton Los Angeles b3@email.com 4.9
John Myers New York b4@email.com 1.9
Peter Johnson San Francisco, Chicago b5@email.col 2.3
Aby Peters Los Angeles b6@email.com 1.8
Amy Thomas San Francisco b7@email.col 8.8
Jessica Thompson Los Angeles, Chicago, New York b8@email.com 4.2
我想计算 citylist 中每个城市在数据框列 'city' 中出现的次数(这部分有效,感谢@scott-boston 在我之前的问题中
(df1['city'].str.split(', ')
.explode()
.value_counts(sort=False)
.reindex(citylist, fill_value=0))
此外,我想按列 'duration' 和按城市分组求和,并计算百分比(组持续时间总和)/(总持续时间)
city list duration %time
New York 3 11.6 0.31
San Francisco 3 17.9 0.47
Los Angeles 4 17.7 0.47
Chicago 3 7.7 0.20
Miami 0 0 0
- 您可以分解
city
列上的数据框 - 然后 groupby
city
并使用.agg
进行一些计算。 - 对于
%time
,可以在开头创建一个变量var
,用来获取duration
列的总和,后面会用到获取%总计 - 最后,使用一些列表理解来包含
citylist
中不在数据框中的城市行:
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
var = df['duration'].sum() #to be used later for %time column calculation
df['city'] = df['city'].str.split(', ') # change from string to list in preparation for explode
df = (df.explode('city')
.groupby('city').agg({'email' : 'count', 'duration' : 'sum'}).reset_index()
.rename({'email' : 'list'}, axis=1))
df['%time'] = round(df['duration'] / var, 2)
df = df.append(pd.DataFrame({'city' : [x for x in citylist if x not in df['city'].unique()]})).fillna(0)
df
Out[1]:
city list duration %time
0 Chicago 3.0 7.7 0.21
1 Los Angeles 4.0 17.7 0.47
2 New York 3.0 11.6 0.31
3 San Francisco 3.0 17.9 0.48
0 Miami 0.0 0.0 0.00
解决方案 #2:根据@ScottBoston 的评论,使用 reindex
比列表理解更简洁、更好。您也可以在他的回答中看到这一点
citylist = ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Miami']
var = df['duration'].sum() #to be used later for %time column calculation
df['city'] = df['city'].str.split(', ') # change from string to list in preparation for explode
df = (df.explode('city')
.groupby('city').agg({'email' : 'count', 'duration' : 'sum'})
.rename({'email' : 'list'}, axis=1))
df['%time'] = round(df['duration'] / var, 2)
df.reindex(citylist, fill_value=0).reset_index()
输出:
city list duration %time
0 New York 3 11.4 0.31
1 San Francisco 3 17.9 0.48
2 Los Angeles 4 17.5 0.47
3 Chicago 3 7.5 0.20
4 Miami 0 0.0 0.00