将 Int64 类型的 Pandas 数据帧发送到 GCP Spanner INT64 列

Question

我正在使用 Pandas 数据帧。我有一个来自 CSV 的列，它是混合了空值的整数。

我正在尝试将其转换并以尽可能通用的方式将其插入 Spanner（这样我就可以在以后的工作中使用相同的代码），这降低了我使用哨兵变量的能力。但是，DF 无法处理纯 int 列中的 NaN，因此您必须使用 Int64。当我尝试将它插入 Spanner 时，我收到一个错误，指出它不是 int64 类型，而纯 Python ints 确实有效。在插入期间是否有自动将 Int64 Pandas 值转换为 int 值的方法？由于空值，在插入之前转换列也不起作用。有没有其他解决办法？

尝试从 Series 转换是这样的：

>>>s2=pd.Series([3.0,5.0])
>>>s2
0    3.0
1    5.0
dtype: float64
>>>s1=pd.Series([3.0,None])
>>>s1
0    3.0
1    NaN
dtype: float64
>>>df = pd.DataFrame(data=[s1,s2], dtype=np.int64)
>>>df
   0    1
0  3  NaN
1  3  5.0
>>>df = pd.DataFrame(data={"nullable": s1, "nonnullable": s2}, dtype=np.int64)

最后一条命令产生错误 ValueError: Cannot convert non-finite values (NA or inf) to integer

Answer 1

我无法重现您的问题，但似乎每个人都按预期工作

您是否有一个不可为 null 的列，您正在向其写入 null 值？

正在检索 Spanner 的架构table

from google.cloud import spanner

client = spanner.Client()
database = client.instance('testinstance').database('testdatabase')
table_name='inttable'

query = f'''
SELECT
t.column_name,
t.spanner_type,
t.is_nullable
FROM
information_schema.columns AS t
WHERE
t.table_name = '{table_name}'
'''

with database.snapshot() as snapshot:
    print(list(snapshot.execute_sql(query)))
    # [['nonnullable', 'INT64', 'NO'], ['nullable', 'INT64', 'YES']]

从 Pandas 数据帧插入到 spanner

from google.cloud import spanner

import numpy as np
import pandas as pd

client = spanner.Client()
instance = client.instance('testinstance')
database = instance.database('testdatabase')


def insert(df):
    with database.batch() as batch:
        batch.insert(
            table='inttable',
            columns=(
                'nonnullable', 'nullable'),
            values=df.values.tolist()
        )

print("Succeeds in inserting int rows.")
d = {'nonnullable': [1, 2], 'nullable': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)

print("Succeeds in inserting rows with None in nullable columns.")
d = {'nonnullable': [3, 4], 'nullable': [None, 6]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)

print("Fails (as expected) attempting to insert row with None in a nonnullable column fails as expected")
d = {'nonnullable': [5, None], 'nullable': [6, 0]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)
# Fails with "google.api_core.exceptions.FailedPrecondition: 400 nonnullable must not be NULL in table inttable."

Answer 2

我的解决方案是将其保留为 NaN（结果是 NaN == 'nan'）。然后，在最后，当我插入 Spanner DB 时，我将 DF 中的所有 NaN 替换为 None。我使用了另一个 SO 答案中的代码：df.replace({pd.np.nan: None})。 Spanner 将 NaN 视为 'nan' 字符串并拒绝将其插入 Int64 列。 None 被视为 NULL 并且可以毫无问题地插入 Spanner。

将 Int64 类型的 Pandas 数据帧发送到 GCP Spanner INT64 列

Sending Pandas Dataframe with Int64 type to GCP Spanner INT64 column

python

pandas

google-cloud-platform

google-cloud-spanner

正在检索 Spanner 的架构table

从 Pandas 数据帧插入到 spanner