pandas 数据框中的多个日期格式到单个日期模式

Multiple Date Formate to a single date pattern in pandas dataframe

我有一个 pandas 日期列,格式如下

                                    Date
0                         March 13, 2020, March 13, 2020
1      3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021
2                                                    NaN
3                             May 20, 2022, May 21, 2022

我尝试将模式转换为单个模式以存储到新列。

import pandas as pd
import dateutil.parser

# initialise data of lists.
data = {'Date':['March 13, 2020, March 13, 2020', '3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021', 'NaN','May 20, 2022, May 21, 2022']}
 
# Create DataFrame
df = pd.DataFrame(data)
 
df["FormattedDate"] = df.Date.apply(lambda x: dateutil.parser.parse(x.strftime("%Y-%m-%d") ))

但是我收到一个错误 AttributeError: 'str' 对象没有属性 'strftime'

期望输出

                                    Date                           DateFormatted
0                         March 13, 2020, March 13, 2020       2020-03-13, 2020-03-13   
1      3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021       2021-03-09, 2021-03-09, 2021-03-09, 2021-09-03
2                                                    NaN       NaN                     
3                             May 20, 2022, May 21, 2022       2022-05-20, 2022-05-21

我是以前的解决方案的作者,所以可能的解决方案也是更改它以避免 , 像分隔符和日期字符串中的值一样使用 Series.str.extractall,转换为日期时间,最后是聚合 join:

format_list = ["[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{2,4}",
              "[0-9]{1,2}(?:\.)(?:\s)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\s)?[0-9]{2,4}",
              "(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}",
              "[0-9]{1,2}(?:\.)?(?:\s)?(?:\n)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]

# initialise data of lists.
data = {'Name':['Today is 09 September 2021', np.nan, '25 December 2021 is christmas', '01/01/2022 is newyear and will be holiday on 02.01.2022 also']}

# Create DataFrame
df = pd.DataFrame(data)

import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
                                
df['DateFormatted'] = df['Name'].str.extractall(f'({"|".join(format_list)})')[0].apply(f).groupby(level=0).agg(','.join)
print (df)
                                                Name          DateFormatted
0                         Today is 09 September 2021             2021-09-09
1                                                NaN                    NaN
2                      25 December 2021 is christmas             2021-12-25
3  01/01/2022 is newyear and will be holiday on 0...  2022-01-01,2022-02-01

另一种方法是在生成理解中删除缺失值后处理列表 join:

import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
                                
df['Date'] = df['Name'].str.findall("|".join(format_list)).dropna().apply(lambda y: ','.join(f(x) for x in y))
print (df)
                                                Name                   Date
0                         Today is 09 September 2021             2021-09-09
1                                                NaN                    NaN
2                      25 December 2021 is christmas             2021-12-25
3  01/01/2022 is newyear and will be holiday on 0...  2022-01-01,2022-02-01