python datefinder 的 find_dates 方法没有返回预期的结果

python datefinder's find_dates method is not returning the expected result

我在 Pandas 数据框 'Comment Text' 中有一列包含这种格式的日期(在这里显示,只有第一次观察):

7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56

添加数据框示例:

df = pd.DataFrame({'Common Text':['7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56', 'Adams Gill Christ  4 Oct 2017    02:52 PM', '4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34', '5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31', 'the call was made on 20 Jun 2017\nbut call not found on system', 'Call made on 7/03/2018 8:22:25 AM', 'Review is during 30 May to 1 March 2018']})

但是当我做这样的事情时:

import datefinder
FD = datefinder.find_dates(df['Comment Text'][0])

for dates in FD:
    print(dates)

我得到以下结果:

2018-07-09 11:59:37
2019-06-20 12:01:33
2019-06-20 00:01:56

这是不正确的,因为我只期望 2018-07-09 到 return 结果。

如果我对您的理解正确,并且您的数据始终具有您作为示例显示的两个日期结构。您可以使用正则表达式。

# Make example data
df = pd.DataFrame({'Common Text':['7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56', 
                                  'Adams Gill Christ  4 Oct 2017    02:52 PM', 
                                  '4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34', 
                                  '5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31', 
                                  'the call was made on 20 Jun 2017\nbut call not found on system']})

                                         Common Text
0         7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56
1          Adams Gill Christ  4 Oct 2017    02:52 PM
2          4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34
3        5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31
4  the call was made on 20 Jun 2017\nbut call not...

使用str.extract.

s = df['Common Text'].str.extract('(.+?(?=\s\d{1,2}:\d{2}:\d{2}))|(\d{1,2}\s[A-Za-z]{3}\s\d{4})')
df['Date'] = s[0].fillna(s[1])



                                         Common Text         Date
0         7/09/2018 11:59:37 AM;12:01:33 PM;00:01:56    7/09/2018
1          Adams Gill Christ  4 Oct 2017    02:52 PM   4 Oct 2017
2          4/08/2017 4:30:49 PM ;4:37:23 PM;00:06:34    4/08/2017
3        5/07/2018 10:14:03 AM ;10:21:35 AM;00:07:31    5/07/2018
4  the call was made on 20 Jun 2017\nbut call not...  20 Jun 2017

解释:

  • (.+?(?=\s\d{1,2}:\d{2}:\d{2})):提取时间模式之前的所有内容,即99:99:99
  • (\d{1,2}\s[A-Za-z]{3}\s\d{4}):提取模式:一个或两个数字,space,3个字母,space,4个数字