如何创建一个循环,它可以在一个日期内计算列的重复次数?
How to create a loop where it can count the repetition of a columns within a date?
我正在处理一个大型医学数据集,现在我遇到了一个问题。
我想添加一个新列 "Readmission",它代表 6 个月前(入院之日)进行的手术次数。我有这个:
Patient_ID Surgery_Date
1838 2017-01-05
1838 2018-04-26
87 2017-01-11
1838 2017-07-06
87 2017-03-17
1838 2018-08-02
87 2017-11-15
1838 2018-11-22
87 2017-02-01
87 2017-06-21
1838 2018-06-14
我想要这个:
Patient_ID Surgery_Date Readmission
1838 2017-01-05 0
1838 2018-04-26 0
087 2017-01-11 0
1838 2017-07-06 0
087 2017-03-17 2
1838 2018-08-02 2
087 2017-11-15 1
1838 2018-11-22 2
087 2017-02-01 1
087 2017-06-21 3
1838 2018-06-14 1
我已经在这里问过类似的问题: 这对我的代码有帮助:
import pandas as pd
import datetime as dt
import numpy as np
# Your data plus a new patient that comes often
data = {'Patient_ID':[12,1352,55,1352,12,6,1352,100,100,100,100] ,
'Surgery_Date': ['25/01/2009', '28/01/2009','29/01/2009','12/12/2008','23/02/2008','2/02/2009','12/01/2009','01/01/2009','01/02/2009','01/01/2010','01/02/2010']}
df = pd.DataFrame(data,columns = ['Patient_ID','Surgery_Date'])
readmissions = pd.Series(np.zeros(len(df),dtype=int),index=df.index))
# Loop through all unique ids
all_id = df['Patient_ID'].unique()
id_admissions = {}
for pid in all_id:
# These are all the times a patient with a given ID has had surgery
patient = df.loc[df['Patient_ID']==pid]
admissions_sorted = pd.to_datetime(patient['Surgery_Date'], format='%d/%m/%Y').sort_values()
# This checks if the previous surgery was longer than 180 days ago
frequency = admissions_sorted.diff()<dt.timedelta(days=180)
# Compute the readmission
n_admissions = [0]
for v in frequency.values[1:]:
n_admissions.append((n_admissions[-1]+1)*v)
# Add these value to the time series
readmissions.loc[admissions_sorted.index] = n_admissions
df['Readmission'] = readmissions
但是,结果并不适合每位患者和每个日期。是这样的:
Patient_ID Surgery_Date Readmission
1838 2017-01-05 0
1838 2018-04-26 0
087 2017-01-11 0
1838 2017-07-06 0
087 2017-03-17 2
1838 2018-08-02 2
087 2017-11-15 4 (It's wrong because in the last 6 months there was 1 surgery for this ID)
1838 2018-11-22 3 (It's wrong because in the last 6 months there were 2 surgeries for this ID)
087 2017-02-01 1
087 2017-06-21 3
1838 2018-06-14 1
有人能帮帮我吗?
admissions_sorted.diff()
计算连续的 diff() 并且不超过一个索引。此代码将累计计算 diff() 并与 180 天进行比较:
import pandas as pd
import datetime as dt
import numpy as np
# Your data plus a new patient that comes often
data = {'Patient_ID':[1838,1838,87,1838,87,1838,87,1838,87,87,1838],
'Surgery_Date': ['2017-01-05','2018-04-26','2017-01-11','2017-07-06','2017-03-17','2018-08-02','2017-11-15','2018-11-22','2017-02-01','2017-06-21','2018-06-14']}
df = pd.DataFrame(data,columns = ['Patient_ID','Surgery_Date'])
readmissions = pd.Series(np.zeros(len(df),dtype=int),index=df.index)
# Loop through all unique ids
all_id = df['Patient_ID'].unique()
id_admissions = {}
for pid in all_id:
# These are all the times a patient with a given ID has had surgery
patient = df.loc[df['Patient_ID']==pid]
admissions_sorted = pd.to_datetime(patient['Surgery_Date'], format='%Y-%m-%d').sort_values()
# This checks if the previous surgery was longer than 180 days ago
frequency = admissions_sorted.diff()/np.timedelta64(1, 'D')
n_admissions = []
for i in range(frequency.shape[0]-1):
cumulative_diff = frequency.iloc[::-1][i:-1].cumsum().astype(int)
#Compute the readmission
n_admissions.append(cumulative_diff[cumulative_diff<180].count())
n_admissions.append(0)
n_admissions.reverse()
# Add these value to the time series
readmissions.loc[admissions_sorted.index] = n_admissions
df['Readmission'] = readmissions
输出为:
Patient_ID Surgery_Date Readmission
0 1838 2017-01-05 0
1 1838 2018-04-26 0
2 87 2017-01-11 0
3 1838 2017-07-06 0
4 87 2017-03-17 2
5 1838 2018-08-02 2
6 87 2017-11-15 1
7 1838 2018-11-22 2
8 87 2017-02-01 1
9 87 2017-06-21 3
10 1838 2018-06-14 1
我正在处理一个大型医学数据集,现在我遇到了一个问题。
我想添加一个新列 "Readmission",它代表 6 个月前(入院之日)进行的手术次数。我有这个:
Patient_ID Surgery_Date
1838 2017-01-05
1838 2018-04-26
87 2017-01-11
1838 2017-07-06
87 2017-03-17
1838 2018-08-02
87 2017-11-15
1838 2018-11-22
87 2017-02-01
87 2017-06-21
1838 2018-06-14
我想要这个:
Patient_ID Surgery_Date Readmission
1838 2017-01-05 0
1838 2018-04-26 0
087 2017-01-11 0
1838 2017-07-06 0
087 2017-03-17 2
1838 2018-08-02 2
087 2017-11-15 1
1838 2018-11-22 2
087 2017-02-01 1
087 2017-06-21 3
1838 2018-06-14 1
我已经在这里问过类似的问题:
import pandas as pd
import datetime as dt
import numpy as np
# Your data plus a new patient that comes often
data = {'Patient_ID':[12,1352,55,1352,12,6,1352,100,100,100,100] ,
'Surgery_Date': ['25/01/2009', '28/01/2009','29/01/2009','12/12/2008','23/02/2008','2/02/2009','12/01/2009','01/01/2009','01/02/2009','01/01/2010','01/02/2010']}
df = pd.DataFrame(data,columns = ['Patient_ID','Surgery_Date'])
readmissions = pd.Series(np.zeros(len(df),dtype=int),index=df.index))
# Loop through all unique ids
all_id = df['Patient_ID'].unique()
id_admissions = {}
for pid in all_id:
# These are all the times a patient with a given ID has had surgery
patient = df.loc[df['Patient_ID']==pid]
admissions_sorted = pd.to_datetime(patient['Surgery_Date'], format='%d/%m/%Y').sort_values()
# This checks if the previous surgery was longer than 180 days ago
frequency = admissions_sorted.diff()<dt.timedelta(days=180)
# Compute the readmission
n_admissions = [0]
for v in frequency.values[1:]:
n_admissions.append((n_admissions[-1]+1)*v)
# Add these value to the time series
readmissions.loc[admissions_sorted.index] = n_admissions
df['Readmission'] = readmissions
但是,结果并不适合每位患者和每个日期。是这样的:
Patient_ID Surgery_Date Readmission
1838 2017-01-05 0
1838 2018-04-26 0
087 2017-01-11 0
1838 2017-07-06 0
087 2017-03-17 2
1838 2018-08-02 2
087 2017-11-15 4 (It's wrong because in the last 6 months there was 1 surgery for this ID)
1838 2018-11-22 3 (It's wrong because in the last 6 months there were 2 surgeries for this ID)
087 2017-02-01 1
087 2017-06-21 3
1838 2018-06-14 1
有人能帮帮我吗?
admissions_sorted.diff()
计算连续的 diff() 并且不超过一个索引。此代码将累计计算 diff() 并与 180 天进行比较:
import pandas as pd
import datetime as dt
import numpy as np
# Your data plus a new patient that comes often
data = {'Patient_ID':[1838,1838,87,1838,87,1838,87,1838,87,87,1838],
'Surgery_Date': ['2017-01-05','2018-04-26','2017-01-11','2017-07-06','2017-03-17','2018-08-02','2017-11-15','2018-11-22','2017-02-01','2017-06-21','2018-06-14']}
df = pd.DataFrame(data,columns = ['Patient_ID','Surgery_Date'])
readmissions = pd.Series(np.zeros(len(df),dtype=int),index=df.index)
# Loop through all unique ids
all_id = df['Patient_ID'].unique()
id_admissions = {}
for pid in all_id:
# These are all the times a patient with a given ID has had surgery
patient = df.loc[df['Patient_ID']==pid]
admissions_sorted = pd.to_datetime(patient['Surgery_Date'], format='%Y-%m-%d').sort_values()
# This checks if the previous surgery was longer than 180 days ago
frequency = admissions_sorted.diff()/np.timedelta64(1, 'D')
n_admissions = []
for i in range(frequency.shape[0]-1):
cumulative_diff = frequency.iloc[::-1][i:-1].cumsum().astype(int)
#Compute the readmission
n_admissions.append(cumulative_diff[cumulative_diff<180].count())
n_admissions.append(0)
n_admissions.reverse()
# Add these value to the time series
readmissions.loc[admissions_sorted.index] = n_admissions
df['Readmission'] = readmissions
输出为:
Patient_ID Surgery_Date Readmission
0 1838 2017-01-05 0
1 1838 2018-04-26 0
2 87 2017-01-11 0
3 1838 2017-07-06 0
4 87 2017-03-17 2
5 1838 2018-08-02 2
6 87 2017-11-15 1
7 1838 2018-11-22 2
8 87 2017-02-01 1
9 87 2017-06-21 3
10 1838 2018-06-14 1