尝试删除时间和更改日期格式时出现解析错误?
Parse error when trying to remove time and change date format?
我正在尝试删除社交媒体数据集中的时间并更改我的日期格式,以便在我合并两个数据集时它与我的股票数据兼容。
这是我的社交媒体数据集样本:
0 id created_at
1 1 7:51 PM ET Fri, 17 July 2020
2 2 7:33 PM ET Fri, 17 July 2020
4 4 7:25 PM ET Fri, 17 July 2020
5 5 4:24 PM ET Fri, 17 July 2020
… … …
3076 3076 10:15 AM ET Tue, 26 Dec 2017
3077 3077 11:12 AM ET Thu, 20 Sept 2018
3078 3078 7:07 PM ET Fri, 22 Dec 2017
3079 3079 7:07 PM ET Fri, 22 Dec 2017
3080 3080 6:52 PM ET Fri, 22 Dec 2017
我想让日期看起来像这样:
Date Open High
2017-12-22 2684.22 2685.35
2017-12-26 2679.09 2682.74
2017-12-27 2682.10 2685.64
2017-12-28 2686.10 2687.66
2017-12-29 2689.15 2692.12
这是我尝试过但没有奏效的方法:
pd.to_datetime(data['created_at'])
但我得到错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2053 try:
-> 2054 values, tz_parsed = conversion.datetime_to_datetime64(data)
2055 # If tzaware, these values represent unix timestamps, so we
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
ParserError Traceback (most recent call last)
<ipython-input-13-34e0ddb54ab0> in <module>
----> 1 pd.to_datetime(data['created_at'])
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
801 result = arg.map(cache_array)
802 else:
--> 803 values = convert_listlike(arg._values, format)
804 result = arg._constructor(values, index=arg.index, name=arg.name)
805 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
457 assert format is None or infer_datetime_format
458 utc = tz == "utc"
--> 459 result, tz_parsed = objects_to_datetime64ns(
460 arg,
461 dayfirst=dayfirst,
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2057 return values.view("i8"), tz_parsed
2058 except (ValueError, TypeError):
-> 2059 raise e
2060
2061 if tz_parsed is not None:
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2042
2043 try:
-> 2044 result, tz_parsed = tslib.array_to_datetime(
2045 data,
2046 errors=errors,
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()
pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()
~/opt/anaconda3/lib/python3.8/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
1366 return parser(parserinfo).parse(timestr, **kwargs)
1367 else:
-> 1368 return DEFAULTPARSER.parse(timestr, **kwargs)
1369
1370
~/opt/anaconda3/lib/python3.8/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
641
642 if res is None:
--> 643 raise ParserError("Unknown string format: %s", timestr)
644
645 if len(res) == 0:
ParserError: Unknown string format: created_at
感谢您的帮助:)
编辑:Sample of dataset
拆分 ,
并保留第二部分(日期)并使用 pd.to_datetime
:
将其转换为日期时间
>>> pd.to_datetime(df['created_at'].str.split(', ').str[1])
1 2020-07-17
2 2020-07-17
4 2020-07-17
5 2020-07-17
3076 2017-12-26
3077 2018-09-20
3078 2017-12-22
3079 2017-12-22
3080 2017-12-22
Name: created_at, dtype: datetime64[ns]
旧答案
您可以使用 dateutil
包(已与 pandas
一起安装):
from dateutil import parser
>>> df['created_at'].apply(parser.parse, tzinfos={'ET': -4*3600})
1 2020-07-17 19:51:00-04:00
2 2020-07-17 19:33:00-04:00
4 2020-07-17 19:25:00-04:00
5 2020-07-17 16:24:00-04:00
3076 2017-12-26 10:15:00-04:00
3077 2018-09-20 11:12:00-04:00
3078 2017-12-22 19:07:00-04:00
3079 2017-12-22 19:07:00-04:00
3080 2017-12-22 18:52:00-04:00
Name: created_at, dtype: datetime64[ns, tzoffset('ET', -14400)]
如果需要,您可以向字典添加其他时区 tzinfos
。
更新
ParserError: Unknown string format: created_at.
引发此异常是因为在列 df['created_at']
中,您的值为 'created_at'。例如:
>>> df
id created_at
0 0 hello # <- it's not a valid datetime
1 1 7:51 PM ET Fri, 17 July 2020
2 2 7:33 PM ET Fri, 17 July 2020
>>> df['created_at'].apply(parser.parse, tzinfos={'ET': -4*3600})
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
...
ParserError: Unknown string format: hello # 'hello' is not a valid datetime
要查找不正确的内容,请搜索所有不包含 'AM' 或 'PM' 作为值的行:
>>> df.loc[~df['created_at'].str.contains(r'(?:AM|PM)'), 'created_at']
1 hello
Name: created_at, dtype: object
我正在尝试删除社交媒体数据集中的时间并更改我的日期格式,以便在我合并两个数据集时它与我的股票数据兼容。
这是我的社交媒体数据集样本:
0 id created_at
1 1 7:51 PM ET Fri, 17 July 2020
2 2 7:33 PM ET Fri, 17 July 2020
4 4 7:25 PM ET Fri, 17 July 2020
5 5 4:24 PM ET Fri, 17 July 2020
… … …
3076 3076 10:15 AM ET Tue, 26 Dec 2017
3077 3077 11:12 AM ET Thu, 20 Sept 2018
3078 3078 7:07 PM ET Fri, 22 Dec 2017
3079 3079 7:07 PM ET Fri, 22 Dec 2017
3080 3080 6:52 PM ET Fri, 22 Dec 2017
我想让日期看起来像这样:
Date Open High
2017-12-22 2684.22 2685.35
2017-12-26 2679.09 2682.74
2017-12-27 2682.10 2685.64
2017-12-28 2686.10 2687.66
2017-12-29 2689.15 2692.12
这是我尝试过但没有奏效的方法:
pd.to_datetime(data['created_at'])
但我得到错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2053 try:
-> 2054 values, tz_parsed = conversion.datetime_to_datetime64(data)
2055 # If tzaware, these values represent unix timestamps, so we
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
ParserError Traceback (most recent call last)
<ipython-input-13-34e0ddb54ab0> in <module>
----> 1 pd.to_datetime(data['created_at'])
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
801 result = arg.map(cache_array)
802 else:
--> 803 values = convert_listlike(arg._values, format)
804 result = arg._constructor(values, index=arg.index, name=arg.name)
805 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
457 assert format is None or infer_datetime_format
458 utc = tz == "utc"
--> 459 result, tz_parsed = objects_to_datetime64ns(
460 arg,
461 dayfirst=dayfirst,
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2057 return values.view("i8"), tz_parsed
2058 except (ValueError, TypeError):
-> 2059 raise e
2060
2061 if tz_parsed is not None:
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
2042
2043 try:
-> 2044 result, tz_parsed = tslib.array_to_datetime(
2045 data,
2046 errors=errors,
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()
pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime_object()
pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()
~/opt/anaconda3/lib/python3.8/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
1366 return parser(parserinfo).parse(timestr, **kwargs)
1367 else:
-> 1368 return DEFAULTPARSER.parse(timestr, **kwargs)
1369
1370
~/opt/anaconda3/lib/python3.8/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
641
642 if res is None:
--> 643 raise ParserError("Unknown string format: %s", timestr)
644
645 if len(res) == 0:
ParserError: Unknown string format: created_at
感谢您的帮助:)
编辑:Sample of dataset
拆分 ,
并保留第二部分(日期)并使用 pd.to_datetime
:
>>> pd.to_datetime(df['created_at'].str.split(', ').str[1])
1 2020-07-17
2 2020-07-17
4 2020-07-17
5 2020-07-17
3076 2017-12-26
3077 2018-09-20
3078 2017-12-22
3079 2017-12-22
3080 2017-12-22
Name: created_at, dtype: datetime64[ns]
旧答案
您可以使用 dateutil
包(已与 pandas
一起安装):
from dateutil import parser
>>> df['created_at'].apply(parser.parse, tzinfos={'ET': -4*3600})
1 2020-07-17 19:51:00-04:00
2 2020-07-17 19:33:00-04:00
4 2020-07-17 19:25:00-04:00
5 2020-07-17 16:24:00-04:00
3076 2017-12-26 10:15:00-04:00
3077 2018-09-20 11:12:00-04:00
3078 2017-12-22 19:07:00-04:00
3079 2017-12-22 19:07:00-04:00
3080 2017-12-22 18:52:00-04:00
Name: created_at, dtype: datetime64[ns, tzoffset('ET', -14400)]
如果需要,您可以向字典添加其他时区 tzinfos
。
更新
ParserError: Unknown string format: created_at.
引发此异常是因为在列 df['created_at']
中,您的值为 'created_at'。例如:
>>> df
id created_at
0 0 hello # <- it's not a valid datetime
1 1 7:51 PM ET Fri, 17 July 2020
2 2 7:33 PM ET Fri, 17 July 2020
>>> df['created_at'].apply(parser.parse, tzinfos={'ET': -4*3600})
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
...
ParserError: Unknown string format: hello # 'hello' is not a valid datetime
要查找不正确的内容,请搜索所有不包含 'AM' 或 'PM' 作为值的行:
>>> df.loc[~df['created_at'].str.contains(r'(?:AM|PM)'), 'created_at']
1 hello
Name: created_at, dtype: object