如何在 Python 中的数据框中找到行中发生的错误?

How a find an error occurring in rows in dataframe in Python?

 df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"],format='%d-%m-%y')

我尝试转换日期列,数据集包含超过 100 万行...我必须找到未转换的日期行。

TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-124-d701d963ff8c> in <module>
 ----> 1 df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"],format='%d-%m-%y')

c:\users\dell\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py 
in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, 
origin, cache)
803             result = arg.map(cache_array)
804         else:
--> 805             values = convert_listlike(arg._values, format)
806             result = arg._constructor(values, index=arg.index, name=arg.name)
807     elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

c:\users\dell\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py 
in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, 
yearfirst, exact)
458                 return DatetimeIndex._simple_new(dta, name=name)
459             except (ValueError, TypeError):
--> 460                 raise e
461 
462     if result is None:

c:\users\dell\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py 
in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, 
yearfirst, exact)
421             if result is None:
422                 try:
--> 423                     result, timezones = array_strptime(
424                         arg, format, exact=exact, errors=errors
425                     )

pandas\_libs\tslibs\strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()

ValueError: unconverted data remains: 12

您可以使用 try 和 except 来尝试循环:

causing_error_list = []
for x in df["Dt_Customer"].values:
    try:
        pd.to_datetime(x,format='%d-%m-%y')
    except:
        causing_error_list.append(x)

一个有效的解决方案是将日期字符串解析为日期时间,并将关键字 errors 设置为 'coerce'。这将为无效字符串提供 NaT(非一次)。您可以通过调用 .isnull() 从中派生一个布尔掩码,然后您可以使用它来提取相应的值。

例如:

import pandas as pd

df = pd.DataFrame({"Dt_Customer": ["28-12-2020", "not a date"]})

invalid = df.loc[pd.to_datetime(df["Dt_Customer"],
                                format='%d-%m-%Y',
                                errors='coerce').isnull(), "Dt_Customer"]

print(invalid)
1    not a date
Name: Dt_Customer, dtype: object

请注意,您也可以省略 format 关键字以使检查不特定,即接受解析器可以解析的任何 date/time 格式。