使用具有多个条件、tres 日期和一个对象的 numpy/pandas 过滤 df

Filter df using numpy/pandas with multiple conditions, tres date and one object

我叫维克多,我有阿斯伯格综合症 我无法清楚地理解功能,无法在脑海中综合它们并将它们传达给计算机,但我可以在视觉情况下进行可视化和表达

我有一个数据框,其中填充了多年来的会员注册、取消和非隶属关系。

我需要知道在给定日期有哪些附属公司。

用户可能出于 2 个原因不再是会员,取消或取消会员资格,有时两者兼而有之。

我将向您展示 2 个不同的示例,说明我需要计算机如何处理数据帧

数据库示例:

import pandas as pd 


df = pd.DataFrame({'political party': ['MDB', 'MDB', 'PODE', 'PDT', 'PSL',  'PV', 'PSL', 'PT', 'PL'], 
                   'affiliated': ['Bob', 'John', 'Olivia', 'James', 'Victor', 'Victor', 'Emma', 'Rose', 'Mark'],
                   'date_affiliation': ['2006-01-31', '2011-04-11', '2007-09-04', '2009-10-13', '2017-12-30', '2020-09-02', '1992-02-23', '2010-10-19', '1985-06-22'],
                   'situation': ['unaffiliated', 'affiliated',  'canceled', 'canceled', 'canceled', 'affiliated', 'affiliated', 'unaffiliated', 'canceled'],
                   'date_disaffiliation': ['2020-02-18', '', '', '2011-11-23', '', '', '', '2010-10-30', '2010-04-08'],
                   'date_cancellation': ['', '', '2019-10-15', '2011-11-10', '2020-07-02', '', '', '', '2010-04-08']})

cols_date = ['date_affiliation', 'date_disaffiliation', 'date_cancellation']
for col in cols_date:
    df[col] = pd.to_datetime(df[col], errors='coerce')

print(df)
political party affiliated date_affiliation situation date_disaffiliation date_cancellation
0 MDB Bob 2006-01-31 unaffiliated 2020-02-18 NaT
1 MDB John 2011-04-11 affiliated NaT NaT
2 PODE Olivia 2007-09-04 canceled NaT 2019-10-15
3 PDT James 2009-10-13 canceled 2011-11-23 2011-11-10
4 PSL Victor 2017-12-30 canceled NaT 2020-07-02
5 PV Victor 2020-09-02 affiliated NaT NaT
6 PSL Emma 1992-02-23 affiliated NaT NaT
7 PT Rose 2010-10-19 unaffiliated 2010-10-30 NaT
8 PL Mark 1985-06-22 canceled 2010-04-08 2010-04-08

出样一

political party affiliated date_affiliation situation date_disaffiliation date_cancellation affiliat_2005_08_15 affiliat_2010_08_07 affiliat_2020_01_05 affiliat_2020_11_15
0 MDB Bob 2006-01-31 unaffiliated 2020-02-18 NaT False True True False
1 MDB John 2011-04-11 affiliated NaT NaT False False True True
2 PODE Olivia 2007-09-04 canceled NaT 2019-10-15 False True False False
3 PDT James 2009-10-13 canceled 2011-11-23 2011-11-10 False True False False
4 PSL Victor 2017-12-30 canceled NaT 2020-07-02 False False True False
5 PV Victor 2020-09-02 affiliated NaT NaT False False False True
6 PSL Emma 1992-02-23 affiliated NaT NaT True True True True
7 PT Rose 2010-10-19 unaffiliated 2010-10-30 NaT False False False False
8 PL Mark 1985-06-22 canceled 2010-04-08 2010-04-08 True False False False

输出样本二

2005_08_15

的附属机构
political party affiliated date_affiliation situation date_disaffiliation date_cancellation
0 PSL Emma 1992-02-23 affiliated NaT NaT
1 PL Mark 1985-06-22 canceled 2010-04-08 2010-04-08

2010_08_07

的附属机构
political party affiliated date_affiliation situation date_disaffiliation date_cancellation
0 MDB Bob 2006-01-31 unaffiliated 2020-02-18 NaT
1 PODE Olivia 2007-09-04 canceled NaT 2019-10-15
2 PDT James 2009-10-13 canceled 2011-11-23 2011-11-10
3 PSL Emma 1992-02-23 affiliated NaT NaT

2020_01_05

的附属机构
political party affiliated date_affiliation situation date_disaffiliation date_cancellation
0 MDB Bob 2006-01-31 unaffiliated 2020-02-18 NaT
1 MDB John 2011-04-11 affiliated NaT NaT
2 PSL Victor 2017-12-30 canceled NaT 2020-07-02
3 PSL Emma 1992-02-23 affiliated NaT NaT

df_2020_11_15

的附属机构
political party affiliated date_affiliation situation date_disaffiliation date_cancellation
0 MDB John 2011-04-11 affiliated NaT NaT
1 PV Victor 2020-09-02 affiliated NaT NaT
2 PSL Emma 1992-02-23 affiliated NaT NaT

我想你必须使用

.where()

这样的方法:

df.where(df['date_affiliation'] <= '2005-08-15')

您可以使用:

[ '<' , '>' , '==' , '>=' ]

改为“<=”,找到你想要的数据。

这会帮助您找到正确的方向吗?

# date to test:
date = '2010-08-07'

# Caluclate some help columns:
affiliated_before_date = df.date_affiliation <= date
disaffiliation_before_date =  df.date_disaffiliation <= date
cancellation_before_date =  df.date_cancellation <= date

# Final logic. Must be affiliated, but then NOT disaffiliated or cancelled.
people_to_include = affiliated_before_date & ~( disaffiliation_before_date | cancellation_before_date)

df[people_to_include]

对于要求的第二个输出,我会做类似的事情:

dates_to_add = ['2005-08-15','2010-08-07','2020-01-05','2020-11-15']

def calculate_new_data_column(df, date):
    
    affiliated_before_date = df.date_affiliation <= date
    disaffiliation_before_date =  df.date_disaffiliation <= date
    cancellation_before_date =  df.date_cancellation <= date
    
    return affiliated_before_date & ~( disaffiliation_before_date | cancellation_before_date)

for date in dates_to_add:
    df[f'affiliat-{date}'] = calculate_new_data_column(df, date)