需要帮助获得按患者 ID 分组的最小和最大日期之间的差异,仅使用 Python
Need help getting the difference between a min and max date grouped by patient id using only Python
这是我正在为大数据作业编写的脚本 class。除了最后一块,我得到了所需的统计数据。我需要仅使用 Python 查找给定患者的第一次预约和最后一次预约之间的平均天数、最短天数和最长天数。我可以使用的库是 Numpy、Time、Pandas,我可以在我工作的环境中导入 datetime 和 dateutil。
我已经得到 Patient_id 的输出,时间戳 amin,时间戳 amax 使用:
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})
我试过简单地从时间戳 amax 中减去时间戳 amin 的输出,但我得到了一个错误。我也试过 relativedelta 但它也会产生错误。这是我到目前为止所拥有的。
import time
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import relativedelta as r
'''Given Data'''
events = pd.read_csv('../data/train/events.csv')
mortality = pd.read_csv('../train/mortality_events.csv')
'''Join both dataframes'''
events = events.join(mortality.set_index('patient_id'), on = 'patient_id', rsuffix = '_mortality')
'''use mortality dataframe to list all deceased patients and events dataframe to list all living patients'''
mortality = events.loc[events['label']==1]
events = events.loc[events['label']!=1]
'''changing data type from object to datetime'''
mortality['timestamp'] = pd.to_datetime(mortality['timestamp'], infer_datetime_format = True)
events['timestamp'] = pd.to_datetime(events['timestamp'], infer_datetime_format = True)
mortality['timestamp_mortality'] = pd.to_datetime(mortality['timestamp_mortality'], infer_datetime_format = True)
events['timestamp_mortality'] = pd.to_datetime(events['timestamp_mortality'], infer_datetime_format = True)
'''group by patient ids and find minimum and maximum event dates'''
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})
如果有帮助,我可以用下面的代码在 SQL 中得到我需要的东西,但是这个作业需要我在 Python 中完成。
SELECT e.patient_id,
MIN(e.event_timestamp) as 'min date',
MAX(e.event_timestamp)as 'max date',
DATEDIFF(day,min(e.event_timestamp),max(e.event_timestamp)) as Delta
FROM Big_Data_Health_HW1.dbo.events e
LEFT JOIN Big_Data_Health_HW1.dbo.mortality_events m on m.patient_id =
e.patient_id
WHERE m.label is not null
GROUP BY e.patient_id
我在使用
时得到一个没有属性 'relativedelta' 的 DataFrame 对象
alvRl['RecLen'] = alvRl.relativedelta(alvRl['(timestamp, amin)'],alvRl['(timestamp, amin)'])
Relatice Delta Error
当我使用
时 date_range 出现同样的错误
alvRl['RecLen'] = alvRl.date_range(alvRl['(timestamp, amin'],alvRl['(timestamp, amin'])
Date_Range Error
我在使用时遇到一个关键错误:
alvRl['RecLen'] = alvRl['(timestamp, amin)'] - alvRl['(timestamp, amin)']
Key Error
我只是不确定是否有更好的方法来获得该值。
Desired Output
Current Output
您遇到的错误是因为您在这一行中将 relativedelta
重命名为 r
:
from dateutil import relativedelta as r
您可以从 amax
中减去 amin
,但 alvRl
的列是 MultiIndex
。您必须像这样访问它们:
alvRl[('timestamp', 'RecLen')] = (alvRl[('timestamp', 'amax')] - alvRl[('timestamp', 'amin')]) / pd.Timedelta(days=1)
或者直接删除 MultiIndex
的第一层:
alvRl = alvRl.droplevel(0, axis=1)
alvRl['RecLen'] = (alvRl['amax'] - alvRl['amin']) / pd.Timedelta(days=1)
这是我正在为大数据作业编写的脚本 class。除了最后一块,我得到了所需的统计数据。我需要仅使用 Python 查找给定患者的第一次预约和最后一次预约之间的平均天数、最短天数和最长天数。我可以使用的库是 Numpy、Time、Pandas,我可以在我工作的环境中导入 datetime 和 dateutil。
我已经得到 Patient_id 的输出,时间戳 amin,时间戳 amax 使用:
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})
我试过简单地从时间戳 amax 中减去时间戳 amin 的输出,但我得到了一个错误。我也试过 relativedelta 但它也会产生错误。这是我到目前为止所拥有的。
import time
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import relativedelta as r
'''Given Data'''
events = pd.read_csv('../data/train/events.csv')
mortality = pd.read_csv('../train/mortality_events.csv')
'''Join both dataframes'''
events = events.join(mortality.set_index('patient_id'), on = 'patient_id', rsuffix = '_mortality')
'''use mortality dataframe to list all deceased patients and events dataframe to list all living patients'''
mortality = events.loc[events['label']==1]
events = events.loc[events['label']!=1]
'''changing data type from object to datetime'''
mortality['timestamp'] = pd.to_datetime(mortality['timestamp'], infer_datetime_format = True)
events['timestamp'] = pd.to_datetime(events['timestamp'], infer_datetime_format = True)
mortality['timestamp_mortality'] = pd.to_datetime(mortality['timestamp_mortality'], infer_datetime_format = True)
events['timestamp_mortality'] = pd.to_datetime(events['timestamp_mortality'], infer_datetime_format = True)
'''group by patient ids and find minimum and maximum event dates'''
alvRl = events.groupby(['patient_id']).agg({'timestamp' : [np.min, np.max]})
如果有帮助,我可以用下面的代码在 SQL 中得到我需要的东西,但是这个作业需要我在 Python 中完成。
SELECT e.patient_id,
MIN(e.event_timestamp) as 'min date',
MAX(e.event_timestamp)as 'max date',
DATEDIFF(day,min(e.event_timestamp),max(e.event_timestamp)) as Delta
FROM Big_Data_Health_HW1.dbo.events e
LEFT JOIN Big_Data_Health_HW1.dbo.mortality_events m on m.patient_id =
e.patient_id
WHERE m.label is not null
GROUP BY e.patient_id
我在使用
时得到一个没有属性 'relativedelta' 的 DataFrame 对象alvRl['RecLen'] = alvRl.relativedelta(alvRl['(timestamp, amin)'],alvRl['(timestamp, amin)'])
Relatice Delta Error
当我使用
时 date_range 出现同样的错误alvRl['RecLen'] = alvRl.date_range(alvRl['(timestamp, amin'],alvRl['(timestamp, amin'])
Date_Range Error
我在使用时遇到一个关键错误:
alvRl['RecLen'] = alvRl['(timestamp, amin)'] - alvRl['(timestamp, amin)']
Key Error
我只是不确定是否有更好的方法来获得该值。
Desired Output Current Output
您遇到的错误是因为您在这一行中将 relativedelta
重命名为 r
:
from dateutil import relativedelta as r
您可以从 amax
中减去 amin
,但 alvRl
的列是 MultiIndex
。您必须像这样访问它们:
alvRl[('timestamp', 'RecLen')] = (alvRl[('timestamp', 'amax')] - alvRl[('timestamp', 'amin')]) / pd.Timedelta(days=1)
或者直接删除 MultiIndex
的第一层:
alvRl = alvRl.droplevel(0, axis=1)
alvRl['RecLen'] = (alvRl['amax'] - alvRl['amin']) / pd.Timedelta(days=1)