如何防止 NULL 在 Dataframe 中导致错误的数据类型？

Question

我正在使用 Pandas "read_sql" 将数据集从 SQL 服务器拉入数据帧，使用 Pypyodbc。但是，看起来有时（不总是），当字段中有 NULL 时，数据类型不是 int64，而是 float64。

我有两个字段在 SQL 服务器中都声明为 INT。一个有时有 NULL，另一个似乎总是有 NULL。

这是 SQL 服务器中的架构：

[PLAN_ID] [int] NULL,
[DESTINATION_ID] [int] NULL,
[STORE_ID] [int] NULL,

如果我使用 dict 查看这些字段，我会看到：（还有其他的，但我不确定如何阅读 DICT，所以我包括了上一行）

Name: plan_id, Length: 13193, dtype: int64, 'destination_id': 0        None
1        None
2        None
3        None
4        None
         ...
13188    None
13189    None
13190    None
13191    None
13192    None
Name: dest_id, Length: 13193, dtype: object, 'store_id': 0        175635.0
1        180942.0
2             NaN
3             NaN
4             NaN
           ...
13188     59794.0
13189    180015.0
13190     94819.0
13191    184716.0
13192    182301.0
Name: store_id, Length: 13193, dtype: float64, 'version': 0

这是我正在使用的代码

import pandas as pd
import pypyodbc
from datetime import timedelta, date

start_date = date(2019, 5, 1)
end_date = date(2019, 5, 2)
daterange = pd.date_range(start_date, end_date)

con_string = ('Driver={SQL Server};'
'Server=mysqlservername;'
'Database=mydbname;'
'App=PythonPull;'  #It's not "application name"!
'Trusted_Connection=yes')
cnxn = pypyodbc.connect(con_string)


for single_date in daterange:
    datestr = single_date.strftime("%Y-%m-%d")
    print(datestr)
    tablelist = ["mytablenamehere"]
    for item in tablelist:
        query = f"""
        declare @start_date datetime = '{datestr}'
        declare @end_date   datetime  = dateadd(day,1,'{datestr}')
        SELECT id, customerid FROM mydbname.dbo.{item} with (nolock)
        where submitted >= @start_date and submitted < @end_date
        order by submitted
        """
        result_list = pd.read_sql(query, cnxn)
        #at this point, running result_port_map.__dict__ shows that the ID is an int64, but the customerid is a float64

Answer 1

这是一个巧妙的技巧，使用 pandas 0.24.0+:

使用 astype 和 pd.Int64Dtype nullable Integer datatypes

MVCE：

l = [1, 2, 3, np.nan]
s = pd.Series(l)

输出：

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

s.dtype

dtype('float64')

现在，让我们将 astype 与 pd.Int64Dtype 一起使用：

s = s.astype(pd.Int64Dtype())

输出s：

0      1
1      2
2      3
3    NaN
dtype: Int64

s.dtype

Int64Dtype

如何防止 NULL 在 Dataframe 中导致错误的数据类型？

How do I prevent NULLs from causing the wrong datatype in a Dataframe?

sql-server

pandas

pypyodbc

pandas-datareader