根据日期列筛选和创建列
Filtering and creating a column based on the date column
我有一个示例数据如下:
date Deadline
2018-08-01
2018-08-11
2018-09-18
2018-12-08
2018-12-18
我想在deadline一栏填写代码中描述的条件为“1 DL”、“2 DL”、“3 DL”等。
正在根据 python 中的日期列创建新列。
报错:
('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0')
我试过如下:
df['date'] = pd.to_datetime(df['date'], format = "%y-%m-%d").dt.date
def dead_line(df5):
if((df5['date'] >= datetime.date(2018, 8, 1)) & (df['date'] <= datetime.date(2018, 9, 14))):
return "1 DL"
elif ((df5['date'] >= datetime.date(2018, 9, 15)) & (df5['date'] <= datetime.date(2018, 10, 17))):
return "2 DL"
elif ((df5['date'] >= datetime.date(2018, 10, 18)) & (df5['date'] <= datetime.date(2018, 12, 5))):
return "3 DL"
elif ((df5['date'] >= datetime.date(2018, 12, 6)) & (df5['date'] <= datetime.date(2019, 2, 1))):
return "4 DL & EDL 2"
df['Deadline'] = df.apply(dead_line, axis = 1)
预期输出:
date Deadline
2018-08-01 1 DL
2018-09-16 2 DL
2018-12-07 3 DL
等等。
使用 pd.cut
分类分类
核心问题是您正在尝试 按列 操作 apply
以及 axis=1
。然而 apply
这里需要 行 操作。
也就是说,对于 Pandas,您最好使用向量化的逐列运算。所以不要使用 apply
,而是使用向量化的 pd.cut
。另请注意,无需求助于 Python datetime
.
# convert series to datetime
df['date'] = pd.to_datetime(df['date'])
# remember to include arbitrary lower and upper boundaries
L = ['01-01-2000', '08-01-2018', '09-14-2018', '10-17-2018',
'12-05-2018', '02-01-2019', '01-01-2100']
# convert boundaries to datetime
dates = pd.to_datetime(L).values
# define labels for boundary ranges
labels = ['Error Lower', '1 DL', '2 DL', '3 DL', '4 DL & EDL 2', 'Error Upper']
# apply categorical binning
df['Deadline'] = pd.cut(df['date'], bins=dates, labels=labels, right=False)
print(df)
# date Deadline
# 0 2018-08-01 1 DL
# 1 2018-08-11 1 DL
# 2 2018-09-18 2 DL
# 3 2018-12-08 4 DL & EDL 2
# 4 2018-12-18 4 DL & EDL 2
与上述解决方案不同的解决方案。不要将您的 datetime 转换为 datetime 对象进行比较,而是将其保留为 datetime64,然后将您的过滤函数应用于其他 datetime64 范围:
df['date'] = pd.to_datetime(df['date'], format = "%Y-%m-%d") # leaves as datetime64[ns]
print(df['date'].dtype) #datetime64[ns]
def dead_line(x):
if (x >= pd.to_datetime('2018-08-01')) & (x <= pd.to_datetime('2018-09-14')):
return "1 DL"
elif (x >= pd.to_datetime('2018-09-15')) & (x <=pd.to_datetime('2018-10-17')):
return "2 DL"
elif (x >= pd.to_datetime('2018-10-18')) & (x <= pd.to_datetime('2018-12-05')):
return "3 DL"
elif (x >=pd.to_datetime('2018-12-06')) & (x <= pd.to_datetime('2019-02-01')):
return "4 DL & EDL 2"
df['Deadline'] = df['date'].apply(dead_line) # apply your function to column, not whole df
print(df)
输出:
date Deadline
0 2018-08-01 1 DL
1 2018-08-11 1 DL
2 2018-09-18 2 DL
3 2018-12-08 4 DL & EDL 2
4 2018-12-18 4 DL & EDL 2
我有一个示例数据如下:
date Deadline
2018-08-01
2018-08-11
2018-09-18
2018-12-08
2018-12-18
我想在deadline一栏填写代码中描述的条件为“1 DL”、“2 DL”、“3 DL”等。
正在根据 python 中的日期列创建新列。
报错:
('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0')
我试过如下:
df['date'] = pd.to_datetime(df['date'], format = "%y-%m-%d").dt.date
def dead_line(df5):
if((df5['date'] >= datetime.date(2018, 8, 1)) & (df['date'] <= datetime.date(2018, 9, 14))):
return "1 DL"
elif ((df5['date'] >= datetime.date(2018, 9, 15)) & (df5['date'] <= datetime.date(2018, 10, 17))):
return "2 DL"
elif ((df5['date'] >= datetime.date(2018, 10, 18)) & (df5['date'] <= datetime.date(2018, 12, 5))):
return "3 DL"
elif ((df5['date'] >= datetime.date(2018, 12, 6)) & (df5['date'] <= datetime.date(2019, 2, 1))):
return "4 DL & EDL 2"
df['Deadline'] = df.apply(dead_line, axis = 1)
预期输出:
date Deadline
2018-08-01 1 DL
2018-09-16 2 DL
2018-12-07 3 DL
等等。
使用 pd.cut
分类分类
核心问题是您正在尝试 按列 操作 apply
以及 axis=1
。然而 apply
这里需要 行 操作。
也就是说,对于 Pandas,您最好使用向量化的逐列运算。所以不要使用 apply
,而是使用向量化的 pd.cut
。另请注意,无需求助于 Python datetime
.
# convert series to datetime
df['date'] = pd.to_datetime(df['date'])
# remember to include arbitrary lower and upper boundaries
L = ['01-01-2000', '08-01-2018', '09-14-2018', '10-17-2018',
'12-05-2018', '02-01-2019', '01-01-2100']
# convert boundaries to datetime
dates = pd.to_datetime(L).values
# define labels for boundary ranges
labels = ['Error Lower', '1 DL', '2 DL', '3 DL', '4 DL & EDL 2', 'Error Upper']
# apply categorical binning
df['Deadline'] = pd.cut(df['date'], bins=dates, labels=labels, right=False)
print(df)
# date Deadline
# 0 2018-08-01 1 DL
# 1 2018-08-11 1 DL
# 2 2018-09-18 2 DL
# 3 2018-12-08 4 DL & EDL 2
# 4 2018-12-18 4 DL & EDL 2
与上述解决方案不同的解决方案。不要将您的 datetime 转换为 datetime 对象进行比较,而是将其保留为 datetime64,然后将您的过滤函数应用于其他 datetime64 范围:
df['date'] = pd.to_datetime(df['date'], format = "%Y-%m-%d") # leaves as datetime64[ns]
print(df['date'].dtype) #datetime64[ns]
def dead_line(x):
if (x >= pd.to_datetime('2018-08-01')) & (x <= pd.to_datetime('2018-09-14')):
return "1 DL"
elif (x >= pd.to_datetime('2018-09-15')) & (x <=pd.to_datetime('2018-10-17')):
return "2 DL"
elif (x >= pd.to_datetime('2018-10-18')) & (x <= pd.to_datetime('2018-12-05')):
return "3 DL"
elif (x >=pd.to_datetime('2018-12-06')) & (x <= pd.to_datetime('2019-02-01')):
return "4 DL & EDL 2"
df['Deadline'] = df['date'].apply(dead_line) # apply your function to column, not whole df
print(df)
输出:
date Deadline
0 2018-08-01 1 DL
1 2018-08-11 1 DL
2 2018-09-18 2 DL
3 2018-12-08 4 DL & EDL 2
4 2018-12-18 4 DL & EDL 2