如何根据 Python Pandas 或 SQL 中的条件分组并仅获取具有连续排名的记录

Question

下面是我的数据，有 3 列：

ID - 会员 ID
公司：公司名称
年份 - 入职年份

import pandas as pd
import numpy as np

data = {'ID':[1,1,1,2,2,3,3,3,3,3,4,4,4],
        'Company':['Google','Microsoft','LinkedIn','Youtube','Google','Google','Microsoft','Youtube','Google','Microsoft','Microsoft','Google','LinkedIn'],
        'Year':[2001,2004,2009,2001,2009,1999,2000,2003,2006,2010,2010,2012,2020]}

FullData = pd.DataFrame(data)


FullData - 
ID  Company   Year
1   Google    2001
1   Microsoft 2004
1   LinkedIn  2009
2   Youtube   2001
2   Google    2009
3   Google    1999
3   Microsoft 2000
3   Youtube   2003
3   Google    2006
3   Microsoft 2010
4   Microsoft 2010
4   Google    2012
4   LinkedIn  2020

下面我把数据按ID分组，按照年份排序


FullData['Rank'] = FullData.groupby('ID')['Year'].rank(method='first').astype(int)
FullData


ID  Company    Year    Rank
1   Google     2001     1
1   Microsoft  2004     2
1   LinkedIn   2009     3
2   Youtube    2001     1
2   Google     2009     2
3   Google     1999     1
3   Microsoft  2000     2
3   Youtube    2003     3
3   Google     2006     4
3   Microsoft  2010     5
4   Microsoft  2010     1
4   Google     2012     2
4   LinkedIn   2020     3

现在我只需要获取 google 之后加入 Microsoft 的会员 ID。我只需要获取按 ID 分区或分组的记录，其中包含公司 Google 和 Microsoft，并且 Google 的排名连续跟随 Microsoft。（接受的输出 --> Google - 排名 1 和 Microsoft - 排名 2 或 Google - 排名 4 和 Microsoft - 排名 5 等等..）

下面是所需输出的示例

ID  Company    Year    Rank
1   Google     2001     1
1   Microsoft  2004     2
3   Google     1999     1
3   Microsoft  2000     2
3   Google     2006     4
3   Microsoft  2010     5

或唯一 ID 的计数

Count of Unique ID's/Members who worked for Google prior to Microsoft = 2

感谢任何帮助。提前一百万致谢

Answer 1

使用布尔索引：

def myfunc(df):
    m1 = (df['Company'].eq('Google') & df['Company'].shift(-1).eq('Microsoft'))
    m2 = (df['Rank'].eq(df['Rank'].shift(-1) - 1))
    return df[(m1 & m2) | (m1.shift() & m2.shift())]

out = FullData[FullData['Company'].isin(['Google', 'Microsoft'])] \
          .groupby('ID').apply(myfunc).droplevel(0)
print(out)

# Output:
   ID    Company  Year  Rank
0   1     Google  2001     1
1   1  Microsoft  2004     2
5   3     Google  1999     1
6   3  Microsoft  2000     2
8   3     Google  2006     4
9   3  Microsoft  2010     5

对于唯一计数，使用 out['ID'].nunique()

如何根据 Python Pandas 或 SQL 中的条件分组并仅获取具有连续排名的记录

How to Group and get only the records with consecutive rank based on a condition in Python Pandas or SQL

sql

dataframe

python-3.x

pandas

pandas-groupby