如何保留日期并将非日期合并到 df 的另一列中?
How can I leave dates and merge non-dates into another column of df?
我想在“脏日期”中保留日期格式的项目,并将任何非日期值作为 str 与“评论和垃圾”中的现有数据合并,如果两者都有价值,则用“|”分隔值。请不要讲授数据源等内容,这些是 excel 许多供应商通过电子邮件发送给我们的文件,它们并不都遵循日期列的规则。
这是原始 df:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
我尝试添加第 3 个临时列“合并”以有条件地合并两列并添加一些格式(管道分隔符与 space 之前/之后)希望我可以使用它来获得我的结果然后稍后删除它,除了处理之外,最终结果中不需要此列,因此请根据需要随意避免或删除。
这是我尝试过的代码:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
df_t1 = df_raw.copy(deep=True)
df_t1['Dirty Dates'] = pd.to_datetime(df_t1['Dirty Dates'], errors='coerce')
df_t1['Dirty Dates'] = df_t1['Dirty Dates'].apply(lambda x: x if isinstance(x,datetime.datetime) else np.nan)
if df_t1['Comments & Junk'].isnull:
df_t1['Merge'] = df_t1['Dirty Dates'].astype(str)
else:
df_t1['Merge'] = df_t1['Dirty Dates'].astype(str) + ' | ' + df_t1['Comments & Junk']
print(df_raw)
print(df_t1)
所需的最终输出应如下所示:
我实际上修改了你的 df_raw 一点以添加两个边缘情况,因为这也可能发生在你的数据中(参见第二个代码块)。
这就是我要做的:
def handle(dirty_dates, comm_junk):
dates = pd.to_datetime(dirty_dates, errors="coerce")
isna = dates.isna() # errors= "coerce" will make all nondates NaT
nondates = dirty_dates.where(isna, np.nan)
# then you want | in between any two strings but it could be an NaN or an empty string
toadd = comm_junk.ne("", fill_value="")
comm_junk = comm_junk.where(~(isna & toadd), comm_junk + " | " + nondates)
# but you don't want a pointless | so handle it seperately
comm_junk.loc[isna & ~toadd] = nondates
return dates, comm_junk
从控制台:
>> df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk', 'other'],'Comments & Junk':['','stuff',np.nan,'things',np.nan, ""]})
>> df_raw
Dirty Dates Comments & Junk
0 1/21/22
1 3-1-22 stuff
2 22-4-7 NaN # combination of date and NaN
3 junk things
4 more junk NaN
5 other # combination of nondate and ""
>> dates, comm_junk = handle(df_raw["Dirty Dates"], df_raw["Comments & Junk"])
>> dates
0 2022-01-21
1 2022-03-01
2 2007-04-22
3 NaT
4 NaT
5 NaT
Name: Dirty Dates, dtype: datetime64[ns]
>> comm_junk
0
1 stuff
2 NaN
3 things | junk
4 more junk
5 other
Name: Comments & Junk, dtype: object
如果您遇到问题,请在评论中告诉我。
我想在“脏日期”中保留日期格式的项目,并将任何非日期值作为 str 与“评论和垃圾”中的现有数据合并,如果两者都有价值,则用“|”分隔值。请不要讲授数据源等内容,这些是 excel 许多供应商通过电子邮件发送给我们的文件,它们并不都遵循日期列的规则。
这是原始 df:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
我尝试添加第 3 个临时列“合并”以有条件地合并两列并添加一些格式(管道分隔符与 space 之前/之后)希望我可以使用它来获得我的结果然后稍后删除它,除了处理之外,最终结果中不需要此列,因此请根据需要随意避免或删除。
这是我尝试过的代码:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
df_t1 = df_raw.copy(deep=True)
df_t1['Dirty Dates'] = pd.to_datetime(df_t1['Dirty Dates'], errors='coerce')
df_t1['Dirty Dates'] = df_t1['Dirty Dates'].apply(lambda x: x if isinstance(x,datetime.datetime) else np.nan)
if df_t1['Comments & Junk'].isnull:
df_t1['Merge'] = df_t1['Dirty Dates'].astype(str)
else:
df_t1['Merge'] = df_t1['Dirty Dates'].astype(str) + ' | ' + df_t1['Comments & Junk']
print(df_raw)
print(df_t1)
所需的最终输出应如下所示:
我实际上修改了你的 df_raw 一点以添加两个边缘情况,因为这也可能发生在你的数据中(参见第二个代码块)。
这就是我要做的:
def handle(dirty_dates, comm_junk):
dates = pd.to_datetime(dirty_dates, errors="coerce")
isna = dates.isna() # errors= "coerce" will make all nondates NaT
nondates = dirty_dates.where(isna, np.nan)
# then you want | in between any two strings but it could be an NaN or an empty string
toadd = comm_junk.ne("", fill_value="")
comm_junk = comm_junk.where(~(isna & toadd), comm_junk + " | " + nondates)
# but you don't want a pointless | so handle it seperately
comm_junk.loc[isna & ~toadd] = nondates
return dates, comm_junk
从控制台:
>> df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk', 'other'],'Comments & Junk':['','stuff',np.nan,'things',np.nan, ""]})
>> df_raw
Dirty Dates Comments & Junk
0 1/21/22
1 3-1-22 stuff
2 22-4-7 NaN # combination of date and NaN
3 junk things
4 more junk NaN
5 other # combination of nondate and ""
>> dates, comm_junk = handle(df_raw["Dirty Dates"], df_raw["Comments & Junk"])
>> dates
0 2022-01-21
1 2022-03-01
2 2007-04-22
3 NaT
4 NaT
5 NaT
Name: Dirty Dates, dtype: datetime64[ns]
>> comm_junk
0
1 stuff
2 NaN
3 things | junk
4 more junk
5 other
Name: Comments & Junk, dtype: object
如果您遇到问题,请在评论中告诉我。