pd.read_csv 设置 parse_date = ['column name'] 时未正确解析 date/month 字段
pd.read_csv not correctly parsing date/month field when set parse_date = ['column name']
我 运行 在尝试通过 pandas.read_csv()
的 parse_dates 解析少数日期时遇到了这个错误。在下面的代码片段中,我试图解析格式为 dd/mm/yy
的日期,这导致我进行了不正确的转换。在某些情况下,日期字段被视为月份,反之亦然。
为简单起见,在某些情况下 dd/mm/yy
会转换为 yyyy-dd-mm
而不是 yyyy-mm-dd
。
案例一:
04/10/96 is parsed as 1996-04-10, which is wrong.
案例二:
15/07/97 is parsed as 1997-07-15, which is correct.
案例 3:
10/12/97 is parsed as 1997-10-12, which is wrong.
代码示例
import pandas as pd
df = pd.read_csv('date_time.csv')
print 'Data in csv:'
print df
print df['start_date'].dtypes
print '----------------------------------------------'
df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes
当前输出
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]
预期输出
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]
更多评论:
我可以使用 date_parser
或 pandas.to_datetime()
来指定正确的日期格式。但就我而言,我需要转换 ['01/01/1997','01/02/1967']
之类的日期字段很少,例如 ['//1997', '/02/1967']
。 parse_dates
帮助我将这些类型的日期字段转换为预期的格式,而无需我编写额外的代码行。
有解决办法吗?
错误 Link @GitHub: https://github.com/pydata/pandas/issues/13063
在版本 pandas 0.18.0
中,您可以添加参数 dayfirst=True
然后它起作用:
import pandas as pd
import io
temp=u"""start_date
04/10/96
15/07/97
10/12/97
06/03/99
//1994
/02/1967
"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates = ['start_date'], dayfirst=True)
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
另一个解决方案:
你可以用to_datetime
with different parameters format
and errors='coerce'
and then combine_first
解析:
date1 = pd.to_datetime(df['start_date'], format='%d/%m/%y', errors='coerce')
print date1
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 NaT
5 NaT
Name: start_date, dtype: datetime64[ns]
date2 = pd.to_datetime(df['start_date'], format='/%m/%Y', errors='coerce')
print date2
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
5 1967-02-01
Name: start_date, dtype: datetime64[ns]
date3 = pd.to_datetime(df['start_date'], format='//%Y', errors='coerce')
print date3
0 NaT
1 NaT
2 NaT
3 NaT
4 1994-01-01
5 NaT
Name: start_date, dtype: datetime64[ns]
print date1.combine_first(date2).combine_first(date3)
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
Name: start_date, dtype: datetime64[ns]
我 运行 在尝试通过 pandas.read_csv()
的 parse_dates 解析少数日期时遇到了这个错误。在下面的代码片段中,我试图解析格式为 dd/mm/yy
的日期,这导致我进行了不正确的转换。在某些情况下,日期字段被视为月份,反之亦然。
为简单起见,在某些情况下 dd/mm/yy
会转换为 yyyy-dd-mm
而不是 yyyy-mm-dd
。
案例一:
04/10/96 is parsed as 1996-04-10, which is wrong.
案例二:
15/07/97 is parsed as 1997-07-15, which is correct.
案例 3:
10/12/97 is parsed as 1997-10-12, which is wrong.
代码示例
import pandas as pd
df = pd.read_csv('date_time.csv')
print 'Data in csv:'
print df
print df['start_date'].dtypes
print '----------------------------------------------'
df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes
当前输出
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]
预期输出
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]
更多评论:
我可以使用 date_parser
或 pandas.to_datetime()
来指定正确的日期格式。但就我而言,我需要转换 ['01/01/1997','01/02/1967']
之类的日期字段很少,例如 ['//1997', '/02/1967']
。 parse_dates
帮助我将这些类型的日期字段转换为预期的格式,而无需我编写额外的代码行。
有解决办法吗?
错误 Link @GitHub: https://github.com/pydata/pandas/issues/13063
在版本 pandas 0.18.0
中,您可以添加参数 dayfirst=True
然后它起作用:
import pandas as pd
import io
temp=u"""start_date
04/10/96
15/07/97
10/12/97
06/03/99
//1994
/02/1967
"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates = ['start_date'], dayfirst=True)
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
另一个解决方案:
你可以用to_datetime
with different parameters format
and errors='coerce'
and then combine_first
解析:
date1 = pd.to_datetime(df['start_date'], format='%d/%m/%y', errors='coerce')
print date1
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 NaT
5 NaT
Name: start_date, dtype: datetime64[ns]
date2 = pd.to_datetime(df['start_date'], format='/%m/%Y', errors='coerce')
print date2
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
5 1967-02-01
Name: start_date, dtype: datetime64[ns]
date3 = pd.to_datetime(df['start_date'], format='//%Y', errors='coerce')
print date3
0 NaT
1 NaT
2 NaT
3 NaT
4 1994-01-01
5 NaT
Name: start_date, dtype: datetime64[ns]
print date1.combine_first(date2).combine_first(date3)
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
Name: start_date, dtype: datetime64[ns]