Pandas:在特定条件下对子集进行排序

Pandas: Sort a subset under certain conditions

我有两个数据框,一个包含员工信息,另一个包含项目信息。它们看起来像这样:

df_employees.head(5)

      id    name    salary
 0  0469    Aurore  18100
 1  4c13    Malinda 16100
 2  67f2    Cephus  20500
 3  7b36    Doctor  16000
 4  5c65    Paloma  27100

df_projects.head(5)

    employee_id project_id  start_date  end_date
 0  8411    03245e5933eb    2017-03-16  2017-03-30
 1  b67d    033af173480e    2017-02-27  2017-04-06
 2  5ca8    033af173480e    2017-02-27  2017-04-06
 3  bb0c    03fa66f3a3e8    2020-06-12  None
 4  6d22    03fa66f3a3e8    2020-06-12  None

现在我正试图找到至少完成了 10 个项目的 3 位薪水最低的员工。为此,我首先合并两个数据集,如下所示:

import pandas as pd

df_mer = pd.merge(df_employees, df_projects, left_on= 'id', right_on= 'employee_id')

print(df_mer.head(10))

     id     name  salary employee_id    project_id  start_date    end_date
 0  0469   Aurore   18100        0469  0bd005702cc1  2020-06-17  2020-08-02
 1  0469   Aurore   18100        0469  8b2ce1e01e29  2018-02-05  2018-02-21
 2  0469   Aurore   18100        0469  d580afd6f249  2020-07-04  2020-07-28
 3  4c13  Malinda   16100        4c13  05b299e8ecd9  2018-02-06  2018-03-13
 4  4c13  Malinda   16100        4c13  42c7826cb8e1  2019-02-08  2019-05-21
 5  4c13  Malinda   16100        4c13  64e01cf58730  2018-04-28  2018-07-10
 6  4c13  Malinda   16100        4c13  7854dcf39808  2019-12-05        None
 7  4c13  Malinda   16100        4c13  8505058db062  2018-04-11  2018-05-29
 8  4c13  Malinda   16100        4c13  908c863bccdd  2019-02-04  2019-05-14
 9  4c13  Malinda   16100        4c13  c7bb3ababa5c  2018-02-18  2018-10-12

然后我写了这个命令来获取每个员工完成的项目数:

vn = df_mer['name'].value_counts()
vn.head(5)
Arielle     15
Cornell     10
Devonta     10
Phylicia    10
Abigail      9

现在我想找到至少完成 10 个项目的 3 位薪水最低的员工,但我不知道该怎么做。如果有人能给我提示,我将不胜感激。谢谢。

您可以尝试使用 nsmallest 作为第一个 df。像

df_employees.nsmallest(3,'salary')

并按员工分组并计算第二个项目

df_projects.groupby('employee_id')['project_id'].count()

过滤大于 10 的结果并将它们合并回一起?

vn = df_mer['name'].value_counts()
a=vn[vn > 10].index.tolist()
b=df_mer[df_mer['name'].isin(a)].drop_duplicates(subset='name',keep='first')
c=b.nsmallest(3, 'salary')

3_lowest_paid_employees=c.name

我会以不同的方式安排您的操作。您可以计算合并前员工从 df_projects 开始执行的项目数:

from datetime import date

# You may want to filter by completed projects and future projects
mask = ~df_projects["end_date"].isnull()
mask &= df_projects["start_date"] <= pd.to_datetime(date.today())
n_projects = df_projects.loc[mask, "employee_id"].value_counts()
n_projects.name = "n_projects"

# Reindex employee dataframe by employee id
df_employees.set_index("id", drop=True, inplace=True)

# Left Join
df = df_employees.join(n_projects, how="left")

输出看起来像这样,但我使用了外部连接,因为 none 个 ID 与提供的示例数据相匹配:

        name    salary  n_projects
0469    Aurore  18100.0 NaN
4c13    Malinda 16100.0 NaN
5c65    Paloma  27100.0 NaN
5ca8    NaN     NaN     1.0
67f2    Cephus  20500.0 NaN
7b36    Doctor  16000.0 NaN
8411    NaN     NaN     1.0
b67d    NaN     NaN     1.0

然后您可以继续轻松地进行排序和筛选...

project_limit = 10
df[df["n_projects"] >= project_limit].sort_values(
    by=["salary", "n_projects"],
    ascending=[True, False]
)