如何防止 NULL 在 Dataframe 中导致错误的数据类型?
How do I prevent NULLs from causing the wrong datatype in a Dataframe?
我正在使用 Pandas "read_sql" 将数据集从 SQL 服务器拉入数据帧,使用 Pypyodbc。但是,看起来有时(不总是),当字段中有 NULL 时,数据类型不是 int64,而是 float64。
我有两个字段在 SQL 服务器中都声明为 INT。一个有时有 NULL,另一个似乎总是有 NULL。
这是 SQL 服务器中的架构:
[PLAN_ID] [int] NULL,
[DESTINATION_ID] [int] NULL,
[STORE_ID] [int] NULL,
如果我使用 dict 查看这些字段,我会看到:
(还有其他的,但我不确定如何阅读 DICT,所以我包括了上一行)
Name: plan_id, Length: 13193, dtype: int64, 'destination_id': 0 None
1 None
2 None
3 None
4 None
...
13188 None
13189 None
13190 None
13191 None
13192 None
Name: dest_id, Length: 13193, dtype: object, 'store_id': 0 175635.0
1 180942.0
2 NaN
3 NaN
4 NaN
...
13188 59794.0
13189 180015.0
13190 94819.0
13191 184716.0
13192 182301.0
Name: store_id, Length: 13193, dtype: float64, 'version': 0
这是我正在使用的代码
import pandas as pd
import pypyodbc
from datetime import timedelta, date
start_date = date(2019, 5, 1)
end_date = date(2019, 5, 2)
daterange = pd.date_range(start_date, end_date)
con_string = ('Driver={SQL Server};'
'Server=mysqlservername;'
'Database=mydbname;'
'App=PythonPull;' #It's not "application name"!
'Trusted_Connection=yes')
cnxn = pypyodbc.connect(con_string)
for single_date in daterange:
datestr = single_date.strftime("%Y-%m-%d")
print(datestr)
tablelist = ["mytablenamehere"]
for item in tablelist:
query = f"""
declare @start_date datetime = '{datestr}'
declare @end_date datetime = dateadd(day,1,'{datestr}')
SELECT id, customerid FROM mydbname.dbo.{item} with (nolock)
where submitted >= @start_date and submitted < @end_date
order by submitted
"""
result_list = pd.read_sql(query, cnxn)
#at this point, running result_port_map.__dict__ shows that the ID is an int64, but the customerid is a float64
这是一个巧妙的技巧,使用 pandas 0.24.0+:
使用 astype
和 pd.Int64Dtype
nullable Integer datatypes
MVCE:
l = [1, 2, 3, np.nan]
s = pd.Series(l)
输出:
0 1.0
1 2.0
2 3.0
3 NaN
dtype: float64
s.dtype
dtype('float64')
现在,让我们将 astype
与 pd.Int64Dtype
一起使用:
s = s.astype(pd.Int64Dtype())
输出s:
0 1
1 2
2 3
3 NaN
dtype: Int64
s.dtype
Int64Dtype
我正在使用 Pandas "read_sql" 将数据集从 SQL 服务器拉入数据帧,使用 Pypyodbc。但是,看起来有时(不总是),当字段中有 NULL 时,数据类型不是 int64,而是 float64。
我有两个字段在 SQL 服务器中都声明为 INT。一个有时有 NULL,另一个似乎总是有 NULL。
这是 SQL 服务器中的架构:
[PLAN_ID] [int] NULL,
[DESTINATION_ID] [int] NULL,
[STORE_ID] [int] NULL,
如果我使用 dict 查看这些字段,我会看到: (还有其他的,但我不确定如何阅读 DICT,所以我包括了上一行)
Name: plan_id, Length: 13193, dtype: int64, 'destination_id': 0 None
1 None
2 None
3 None
4 None
...
13188 None
13189 None
13190 None
13191 None
13192 None
Name: dest_id, Length: 13193, dtype: object, 'store_id': 0 175635.0
1 180942.0
2 NaN
3 NaN
4 NaN
...
13188 59794.0
13189 180015.0
13190 94819.0
13191 184716.0
13192 182301.0
Name: store_id, Length: 13193, dtype: float64, 'version': 0
这是我正在使用的代码
import pandas as pd
import pypyodbc
from datetime import timedelta, date
start_date = date(2019, 5, 1)
end_date = date(2019, 5, 2)
daterange = pd.date_range(start_date, end_date)
con_string = ('Driver={SQL Server};'
'Server=mysqlservername;'
'Database=mydbname;'
'App=PythonPull;' #It's not "application name"!
'Trusted_Connection=yes')
cnxn = pypyodbc.connect(con_string)
for single_date in daterange:
datestr = single_date.strftime("%Y-%m-%d")
print(datestr)
tablelist = ["mytablenamehere"]
for item in tablelist:
query = f"""
declare @start_date datetime = '{datestr}'
declare @end_date datetime = dateadd(day,1,'{datestr}')
SELECT id, customerid FROM mydbname.dbo.{item} with (nolock)
where submitted >= @start_date and submitted < @end_date
order by submitted
"""
result_list = pd.read_sql(query, cnxn)
#at this point, running result_port_map.__dict__ shows that the ID is an int64, but the customerid is a float64
这是一个巧妙的技巧,使用 pandas 0.24.0+:
使用 astype
和 pd.Int64Dtype
nullable Integer datatypes
MVCE:
l = [1, 2, 3, np.nan]
s = pd.Series(l)
输出:
0 1.0
1 2.0
2 3.0
3 NaN
dtype: float64
s.dtype
dtype('float64')
现在,让我们将 astype
与 pd.Int64Dtype
一起使用:
s = s.astype(pd.Int64Dtype())
输出s:
0 1
1 2
2 3
3 NaN
dtype: Int64
s.dtype
Int64Dtype