需要帮助获得按患者 ID 分组的最小和最大日期之间的差异,仅使用 Python

Need help getting the difference between a min and max date grouped by patient id using only Python

这是我正在为大数据作业编写的脚本 class。除了最后一块,我得到了所需的统计数据。我需要仅使用 Python 查找给定患者的第一次预约和最后一次预约之间的平均天数、最短天数和最长天数。我可以使用的库是 Numpy、Time、Pandas,我可以在我工作的环境中导入 datetime 和 dateutil。

我已经得到 Patient_id 的输出,时间戳 amin,时间戳 amax 使用:

alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})

我试过简单地从时间戳 amax 中减去时间戳 amin 的输出,但我得到了一个错误。我也试过 relativedelta 但它也会产生错误。这是我到目前为止所拥有的。

import time
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import relativedelta as r

'''Given Data'''
events = pd.read_csv('../data/train/events.csv')
mortality = pd.read_csv('../train/mortality_events.csv')

'''Join both dataframes'''
events = events.join(mortality.set_index('patient_id'), on = 'patient_id', rsuffix = '_mortality')

'''use mortality dataframe to list all deceased patients and events dataframe to list all living patients'''
mortality = events.loc[events['label']==1]
events = events.loc[events['label']!=1]

'''changing data type from object to datetime'''
mortality['timestamp'] = pd.to_datetime(mortality['timestamp'], infer_datetime_format = True)
events['timestamp'] = pd.to_datetime(events['timestamp'], infer_datetime_format = True)
mortality['timestamp_mortality'] = pd.to_datetime(mortality['timestamp_mortality'], infer_datetime_format = True)
events['timestamp_mortality'] = pd.to_datetime(events['timestamp_mortality'], infer_datetime_format = True)

'''group by patient ids and find minimum and maximum event dates'''
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})

如果有帮助,我可以用下面的代码在 SQL 中得到我需要的东西,但是这个作业需要我在 Python 中完成。

SELECT e.patient_id, 
   MIN(e.event_timestamp) as 'min date', 
   MAX(e.event_timestamp)as 'max date', 
   DATEDIFF(day,min(e.event_timestamp),max(e.event_timestamp)) as Delta
FROM Big_Data_Health_HW1.dbo.events e
LEFT JOIN Big_Data_Health_HW1.dbo.mortality_events m on m.patient_id = 
e.patient_id
WHERE m.label is not null
GROUP BY e.patient_id

我在使用

时得到一个没有属性 'relativedelta' 的 DataFrame 对象
alvRl['RecLen'] = alvRl.relativedelta(alvRl['(timestamp, amin)'],alvRl['(timestamp, amin)']) 

Relatice Delta Error

当我使用

时 date_range 出现同样的错误
alvRl['RecLen'] = alvRl.date_range(alvRl['(timestamp, amin'],alvRl['(timestamp, amin']) 

Date_Range Error

我在使用时遇到一个关键错误:

alvRl['RecLen'] = alvRl['(timestamp, amin)'] - alvRl['(timestamp, amin)'] 

Key Error

我只是不确定是否有更好的方法来获得该值。

Desired Output Current Output

您遇到的错误是因为您在这一行中将 relativedelta 重命名为 r:

from dateutil import relativedelta as r

您可以从 amax 中减去 amin,但 alvRl 的列是 MultiIndex。您必须像这样访问它们:

alvRl[('timestamp', 'RecLen')] = (alvRl[('timestamp', 'amax')] - alvRl[('timestamp', 'amin')]) / pd.Timedelta(days=1)

或者直接删除 MultiIndex 的第一层:

alvRl = alvRl.droplevel(0, axis=1)
alvRl['RecLen'] = (alvRl['amax'] - alvRl['amin']) / pd.Timedelta(days=1)