将格式为序数日、缩写月份名称和正常年份的字符串日期列转换为%Y-%m-%d

Convert string date column with format of ordinal numeral day, abbreviated month name, and normal year to %Y-%m-%d

给定以下 df 字符串 date 列,日期为序号,月份为缩写月份名称,年份为正常:

             date       oil       gas
0    1st Oct 2021       428        99
1   10th Sep 2021       401       101
2    2nd Oct 2020       189        74
3   10th Jan 2020       659       119
4    1st Nov 2019       691       130
5   30th Aug 2019       742       162
6   10th May 2019       805       183
7   24th Aug 2018       860       182
8    1st Sep 2017       759       183
9   10th Mar 2017       617       151
10  10th Feb 2017       591       149
11  22nd Apr 2016       343        88
12  10th Apr 2015       760       225
13  23rd Jan 2015      1317       316

我想知道我们如何将 date 列解析为标准 %Y-%m-%d 格式?

到目前为止我的想法: 1. 从字符日期字符串中去除序号指示符 ('st', 'nd', 'rd', 'th'),同时保留日期编号 re; 2. 并将缩写的月份名称转换为数字(似乎不是 %b), 3. 最后将它们转换为 %Y-%m-%d.

代码可能对第一步有用:

re.compile(r"(?<=\d)(st|nd|rd|th)").sub("", df['date'])

参考文献:

https://metacpan.org/release/DROLSKY/DateTime-Locale-0.46/view/lib/DateTime/Locale/en_US.pm#Months

如果您不指定 format 参数,

pd.to_datetime 已经处理了这种情况:

>>> pd.to_datetime(df['date'])
0    2021-10-01
1    2021-09-10
2    2020-10-02
3    2020-01-10
4    2019-11-01
5    2019-08-30
6    2019-05-10
7    2018-08-24
8    2017-09-01
9    2017-03-10
10   2017-02-10
11   2016-04-22
12   2015-04-10
13   2015-01-23
Name: date, dtype: datetime64[ns]