Pandas:在特定条件下对子集进行排序
Pandas: Sort a subset under certain conditions
我有两个数据框,一个包含员工信息,另一个包含项目信息。它们看起来像这样:
df_employees.head(5)
id name salary
0 0469 Aurore 18100
1 4c13 Malinda 16100
2 67f2 Cephus 20500
3 7b36 Doctor 16000
4 5c65 Paloma 27100
df_projects.head(5)
employee_id project_id start_date end_date
0 8411 03245e5933eb 2017-03-16 2017-03-30
1 b67d 033af173480e 2017-02-27 2017-04-06
2 5ca8 033af173480e 2017-02-27 2017-04-06
3 bb0c 03fa66f3a3e8 2020-06-12 None
4 6d22 03fa66f3a3e8 2020-06-12 None
现在我正试图找到至少完成了 10 个项目的 3 位薪水最低的员工。为此,我首先合并两个数据集,如下所示:
import pandas as pd
df_mer = pd.merge(df_employees, df_projects, left_on= 'id', right_on= 'employee_id')
print(df_mer.head(10))
id name salary employee_id project_id start_date end_date
0 0469 Aurore 18100 0469 0bd005702cc1 2020-06-17 2020-08-02
1 0469 Aurore 18100 0469 8b2ce1e01e29 2018-02-05 2018-02-21
2 0469 Aurore 18100 0469 d580afd6f249 2020-07-04 2020-07-28
3 4c13 Malinda 16100 4c13 05b299e8ecd9 2018-02-06 2018-03-13
4 4c13 Malinda 16100 4c13 42c7826cb8e1 2019-02-08 2019-05-21
5 4c13 Malinda 16100 4c13 64e01cf58730 2018-04-28 2018-07-10
6 4c13 Malinda 16100 4c13 7854dcf39808 2019-12-05 None
7 4c13 Malinda 16100 4c13 8505058db062 2018-04-11 2018-05-29
8 4c13 Malinda 16100 4c13 908c863bccdd 2019-02-04 2019-05-14
9 4c13 Malinda 16100 4c13 c7bb3ababa5c 2018-02-18 2018-10-12
然后我写了这个命令来获取每个员工完成的项目数:
vn = df_mer['name'].value_counts()
vn.head(5)
Arielle 15
Cornell 10
Devonta 10
Phylicia 10
Abigail 9
现在我想找到至少完成 10 个项目的 3 位薪水最低的员工,但我不知道该怎么做。如果有人能给我提示,我将不胜感激。谢谢。
您可以尝试使用 nsmallest 作为第一个 df。像
df_employees.nsmallest(3,'salary')
并按员工分组并计算第二个项目
df_projects.groupby('employee_id')['project_id'].count()
过滤大于 10 的结果并将它们合并回一起?
vn = df_mer['name'].value_counts()
a=vn[vn > 10].index.tolist()
b=df_mer[df_mer['name'].isin(a)].drop_duplicates(subset='name',keep='first')
c=b.nsmallest(3, 'salary')
3_lowest_paid_employees=c.name
我会以不同的方式安排您的操作。您可以计算合并前员工从 df_projects 开始执行的项目数:
from datetime import date
# You may want to filter by completed projects and future projects
mask = ~df_projects["end_date"].isnull()
mask &= df_projects["start_date"] <= pd.to_datetime(date.today())
n_projects = df_projects.loc[mask, "employee_id"].value_counts()
n_projects.name = "n_projects"
# Reindex employee dataframe by employee id
df_employees.set_index("id", drop=True, inplace=True)
# Left Join
df = df_employees.join(n_projects, how="left")
输出看起来像这样,但我使用了外部连接,因为 none 个 ID 与提供的示例数据相匹配:
name salary n_projects
0469 Aurore 18100.0 NaN
4c13 Malinda 16100.0 NaN
5c65 Paloma 27100.0 NaN
5ca8 NaN NaN 1.0
67f2 Cephus 20500.0 NaN
7b36 Doctor 16000.0 NaN
8411 NaN NaN 1.0
b67d NaN NaN 1.0
然后您可以继续轻松地进行排序和筛选...
project_limit = 10
df[df["n_projects"] >= project_limit].sort_values(
by=["salary", "n_projects"],
ascending=[True, False]
)
我有两个数据框,一个包含员工信息,另一个包含项目信息。它们看起来像这样:
df_employees.head(5)
id name salary
0 0469 Aurore 18100
1 4c13 Malinda 16100
2 67f2 Cephus 20500
3 7b36 Doctor 16000
4 5c65 Paloma 27100
df_projects.head(5)
employee_id project_id start_date end_date
0 8411 03245e5933eb 2017-03-16 2017-03-30
1 b67d 033af173480e 2017-02-27 2017-04-06
2 5ca8 033af173480e 2017-02-27 2017-04-06
3 bb0c 03fa66f3a3e8 2020-06-12 None
4 6d22 03fa66f3a3e8 2020-06-12 None
现在我正试图找到至少完成了 10 个项目的 3 位薪水最低的员工。为此,我首先合并两个数据集,如下所示:
import pandas as pd
df_mer = pd.merge(df_employees, df_projects, left_on= 'id', right_on= 'employee_id')
print(df_mer.head(10))
id name salary employee_id project_id start_date end_date
0 0469 Aurore 18100 0469 0bd005702cc1 2020-06-17 2020-08-02
1 0469 Aurore 18100 0469 8b2ce1e01e29 2018-02-05 2018-02-21
2 0469 Aurore 18100 0469 d580afd6f249 2020-07-04 2020-07-28
3 4c13 Malinda 16100 4c13 05b299e8ecd9 2018-02-06 2018-03-13
4 4c13 Malinda 16100 4c13 42c7826cb8e1 2019-02-08 2019-05-21
5 4c13 Malinda 16100 4c13 64e01cf58730 2018-04-28 2018-07-10
6 4c13 Malinda 16100 4c13 7854dcf39808 2019-12-05 None
7 4c13 Malinda 16100 4c13 8505058db062 2018-04-11 2018-05-29
8 4c13 Malinda 16100 4c13 908c863bccdd 2019-02-04 2019-05-14
9 4c13 Malinda 16100 4c13 c7bb3ababa5c 2018-02-18 2018-10-12
然后我写了这个命令来获取每个员工完成的项目数:
vn = df_mer['name'].value_counts()
vn.head(5)
Arielle 15
Cornell 10
Devonta 10
Phylicia 10
Abigail 9
现在我想找到至少完成 10 个项目的 3 位薪水最低的员工,但我不知道该怎么做。如果有人能给我提示,我将不胜感激。谢谢。
您可以尝试使用 nsmallest 作为第一个 df。像
df_employees.nsmallest(3,'salary')
并按员工分组并计算第二个项目
df_projects.groupby('employee_id')['project_id'].count()
过滤大于 10 的结果并将它们合并回一起?
vn = df_mer['name'].value_counts()
a=vn[vn > 10].index.tolist()
b=df_mer[df_mer['name'].isin(a)].drop_duplicates(subset='name',keep='first')
c=b.nsmallest(3, 'salary')
3_lowest_paid_employees=c.name
我会以不同的方式安排您的操作。您可以计算合并前员工从 df_projects 开始执行的项目数:
from datetime import date
# You may want to filter by completed projects and future projects
mask = ~df_projects["end_date"].isnull()
mask &= df_projects["start_date"] <= pd.to_datetime(date.today())
n_projects = df_projects.loc[mask, "employee_id"].value_counts()
n_projects.name = "n_projects"
# Reindex employee dataframe by employee id
df_employees.set_index("id", drop=True, inplace=True)
# Left Join
df = df_employees.join(n_projects, how="left")
输出看起来像这样,但我使用了外部连接,因为 none 个 ID 与提供的示例数据相匹配:
name salary n_projects
0469 Aurore 18100.0 NaN
4c13 Malinda 16100.0 NaN
5c65 Paloma 27100.0 NaN
5ca8 NaN NaN 1.0
67f2 Cephus 20500.0 NaN
7b36 Doctor 16000.0 NaN
8411 NaN NaN 1.0
b67d NaN NaN 1.0
然后您可以继续轻松地进行排序和筛选...
project_limit = 10
df[df["n_projects"] >= project_limit].sort_values(
by=["salary", "n_projects"],
ascending=[True, False]
)