尝试删除时间和更改日期格式时出现解析错误?

Parse error when trying to remove time and change date format?

我正在尝试删除社交媒体数据集中的时间并更改我的日期格式,以便在我合并两个数据集时它与我的股票数据兼容。

这是我的社交媒体数据集样本:

0       id      created_at
1       1       7:51 PM ET Fri, 17 July 2020
2       2       7:33 PM ET Fri, 17 July 2020
4       4       7:25 PM ET Fri, 17 July 2020
5       5       4:24 PM ET Fri, 17 July 2020
…       …       …
3076    3076    10:15 AM ET Tue, 26 Dec 2017
3077    3077    11:12 AM ET Thu, 20 Sept 2018
3078    3078    7:07 PM ET Fri, 22 Dec 2017
3079    3079    7:07 PM ET Fri, 22 Dec 2017
3080    3080    6:52 PM ET Fri, 22 Dec 2017

我想让日期看起来像这样:

Date        Open    High
2017-12-22  2684.22 2685.35
2017-12-26  2679.09 2682.74
2017-12-27  2682.10 2685.64
2017-12-28  2686.10 2687.66
2017-12-29  2689.15 2692.12

这是我尝试过但没有奏效的方法:

pd.to_datetime(data['created_at'])

但我得到错误:

 ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   2053         try:
-> 2054             values, tz_parsed = conversion.datetime_to_datetime64(data)
   2055             # If tzaware, these values represent unix timestamps, so we

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ParserError                               Traceback (most recent call last)
<ipython-input-13-34e0ddb54ab0> in <module>
----> 1 pd.to_datetime(data['created_at'])

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
    801             result = arg.map(cache_array)
    802         else:
--> 803             values = convert_listlike(arg._values, format)
    804             result = arg._constructor(values, index=arg.index, name=arg.name)
    805     elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    457         assert format is None or infer_datetime_format
    458         utc = tz == "utc"
--> 459         result, tz_parsed = objects_to_datetime64ns(
    460             arg,
    461             dayfirst=dayfirst,

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   2057             return values.view("i8"), tz_parsed
   2058         except (ValueError, TypeError):
-> 2059             raise e
   2060 
   2061     if tz_parsed is not None:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   2042 
   2043     try:
-> 2044         result, tz_parsed = tslib.array_to_datetime(
   2045             data,
   2046             errors=errors,

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

~/opt/anaconda3/lib/python3.8/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1366         return parser(parserinfo).parse(timestr, **kwargs)
   1367     else:
-> 1368         return DEFAULTPARSER.parse(timestr, **kwargs)
   1369 
   1370 

~/opt/anaconda3/lib/python3.8/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    641 
    642         if res is None:
--> 643             raise ParserError("Unknown string format: %s", timestr)
    644 
    645         if len(res) == 0:

ParserError: Unknown string format: created_at 

感谢您的帮助:)

编辑:Sample of dataset

拆分 , 并保留第二部分(日期)并使用 pd.to_datetime:

将其转换为日期时间
>>> pd.to_datetime(df['created_at'].str.split(', ').str[1])
1      2020-07-17
2      2020-07-17
4      2020-07-17
5      2020-07-17
3076   2017-12-26
3077   2018-09-20
3078   2017-12-22
3079   2017-12-22
3080   2017-12-22
Name: created_at, dtype: datetime64[ns]

旧答案 您可以使用 dateutil 包(已与 pandas 一起安装):

from dateutil import parser

>>> df['created_at'].apply(parser.parse, tzinfos={'ET': -4*3600})

1      2020-07-17 19:51:00-04:00
2      2020-07-17 19:33:00-04:00
4      2020-07-17 19:25:00-04:00
5      2020-07-17 16:24:00-04:00
3076   2017-12-26 10:15:00-04:00
3077   2018-09-20 11:12:00-04:00
3078   2017-12-22 19:07:00-04:00
3079   2017-12-22 19:07:00-04:00
3080   2017-12-22 18:52:00-04:00
Name: created_at, dtype: datetime64[ns, tzoffset('ET', -14400)]

如果需要,您可以向字典添加其他时区 tzinfos

更新

ParserError: Unknown string format: created_at.

引发此异常是因为在列 df['created_at'] 中,您的值为 'created_at'。例如:

>>> df
   id                    created_at
0   0                         hello  # <- it's not a valid datetime
1   1  7:51 PM ET Fri, 17 July 2020
2   2  7:33 PM ET Fri, 17 July 2020

>>> df['created_at'].apply(parser.parse, tzinfos={'ET': -4*3600})

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)

...

ParserError: Unknown string format: hello  # 'hello' is not a valid datetime

要查找不正确的内容,请搜索所有不包含 'AM' 或 'PM' 作为值的行:

>>> df.loc[~df['created_at'].str.contains(r'(?:AM|PM)'), 'created_at']

1    hello
Name: created_at, dtype: object