保存到 parquet 文件时如何使用新的 Int64 pandas 对象

Question

我正在使用 Python (Pandas) 将数据从 CSV 转换为 Parquet，稍后将其加载到 Google BigQuery。我有一些包含缺失值的整数列，因为 Pandas 0.24.0 我可以将它们存储为 Int64 dtype。

有没有办法在 parquet 文件中也使用 Int64 dtype？我找不到针对具有缺失值的整数的干净解决方案（因此它们在 BigQuery 中保持为 INTEGER）。

我尝试将其直接导入 BigQuery，但遇到了与使用 Pandas 转换为 parquet 时相同的错误（如下所示。）

导入包含缺失值的 int 列的 CSV：

import pandas as pd
df = pd.read_csv("docs/test_file.csv")
print(df["id"].info())

id 8 non-null float64

该行导入为 float64。我将类型更改为 Int64:

df["id"] = df["id"].astype('Int64')
print(df["id"].info())

id 8 non-null Int64

然后我尝试保存到镶木地板：

df.to_parquet("output/test.parquet")

错误：

pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column id with type Int64')

Answer 1

目前在 https://github.com/googleapis/google-cloud-python/issues/7702 上有一个支持来自 google-cloud-bigquery 的新 Int64 列的未决问题。

同时，我建议使用对象数据类型。在 google-cloud-bigquery 的 1.13.0 版本中，您可以指定所需的 BigQuery 架构，库将在 parquet 文件中使用所需的类型。

    # Schema with all scalar types.
    table_schema = (
        bigquery.SchemaField("int_col", "INTEGER"),
    )

    num_rows = 100
    nulls = [None] * num_rows
    dataframe = pandas.DataFrame(
        {
            "int_col": nulls,
        }
    )

    table_id = "{}.{}.load_table_from_dataframe_w_nulls".format(
        Config.CLIENT.project, dataset_id
    )

    job_config = bigquery.LoadJobConfig(schema=table_schema)
    load_job = Config.CLIENT.load_table_from_dataframe(
        dataframe, table_id, job_config=job_config
    )
    load_job.result()

保存到 parquet 文件时如何使用新的 Int64 pandas 对象

How to use the new Int64 pandas object when saving to a parquet file

python

google-bigquery

parquet

pyarrow