如何保留日期并将非日期合并到 df 的另一列中?

How can I leave dates and merge non-dates into another column of df?

我想在“脏日期”中保留日期格式的项目,并将任何非日期值作为 str 与“评论和垃圾”中的现有数据合并,如果两者都有价值,则用“|”分隔值。请不要讲授数据源等内容,这些是 excel 许多供应商通过电子邮件发送给我们的文件,它们并不都遵循日期列的规则。

这是原始 df:
df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})

我尝试添加第 3 个临时列“合并”以有条件地合并两列并添加一些格式(管道分隔符与 space 之前/之后)希望我可以使用它来获得我的结果然后稍后删除它,除了处理之外,最终结果中不需要此列,因此请根据需要随意避免或删除。

这是我尝试过的代码:

df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk'],'Comments & Junk':['','stuff','','things',np.nan]})
df_t1 = df_raw.copy(deep=True)
df_t1['Dirty Dates'] = pd.to_datetime(df_t1['Dirty Dates'], errors='coerce')
df_t1['Dirty Dates'] = df_t1['Dirty Dates'].apply(lambda x: x if isinstance(x,datetime.datetime) else np.nan)
if df_t1['Comments & Junk'].isnull:
    df_t1['Merge'] = df_t1['Dirty Dates'].astype(str)
else:
    df_t1['Merge'] = df_t1['Dirty Dates'].astype(str) + ' | ' + df_t1['Comments & Junk']

print(df_raw)
print(df_t1)

所需的最终输出应如下所示:

我实际上修改了你的 df_raw 一点以添加两个边缘情况,因为这也可能发生在你的数据中(参见第二个代码块)。

这就是我要做的:

def handle(dirty_dates, comm_junk):
    dates = pd.to_datetime(dirty_dates, errors="coerce")
    isna = dates.isna()  # errors= "coerce" will make all nondates NaT
    nondates = dirty_dates.where(isna, np.nan)

    # then you want | in between any two strings but it could be an NaN or an empty string
    toadd = comm_junk.ne("", fill_value="")
    comm_junk = comm_junk.where(~(isna & toadd), comm_junk + " | " + nondates)
    # but you don't want a pointless | so handle it seperately
    comm_junk.loc[isna & ~toadd] = nondates

    return dates, comm_junk

从控制台:

>> df_raw = pd.DataFrame({'Dirty Dates':['1/21/22','3-1-22','22-4-7','junk','more junk', 'other'],'Comments & Junk':['','stuff',np.nan,'things',np.nan, ""]})

>> df_raw
  Dirty Dates Comments & Junk
0     1/21/22                
1      3-1-22           stuff
2      22-4-7             NaN   # combination of date and NaN
3        junk          things
4   more junk             NaN
5       other                   # combination of nondate and ""

>> dates, comm_junk = handle(df_raw["Dirty Dates"], df_raw["Comments & Junk"])

>> dates
0   2022-01-21
1   2022-03-01
2   2007-04-22
3          NaT
4          NaT
5          NaT
Name: Dirty Dates, dtype: datetime64[ns]

>> comm_junk
0                 
1            stuff
2              NaN
3    things | junk
4        more junk
5            other
Name: Comments & Junk, dtype: object

如果您遇到问题,请在评论中告诉我。