从 pandas 数据帧导入后，字符串在 BigQuery table 中变为浮动

Question

我有一个具有以下数据类型的 pandas 数据框：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 579585 entries, 0 to 579613
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   itemName     579585 non-null  object        
 1   itemId       579585 non-null  string        
 2   Count        579585 non-null  int32         
 3   Sales        579585 non-null  float64       
 4   Date         579585 non-null  datetime64[ns]
 5   Unit_margin  579585 non-null  float64       
 6   GrossProfit  579585 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int32(1), object(1), string(1)
memory usage: 33.2+ MB

我将它上传到 BigQuery table 使用：

df_extended_full.to_gbq('<MY DATSET>.profit', project_id='<MY PROJECT>', chunksize=None,  if_exists='append', auth_local_webserver=False, location=None, progress_bar=True)

除了 string 的 itemId 列变成了 float 并且所有前导 0:s（我需要的）因此都是已删除（只要有）。

我当然可以为我的 table 定义一个架构，但我想避免这种情况。我错过了什么？

Answer 1

问题出在“to_gbq”组件上。由于某种原因，此输出省略了数据字段中的引号。没有引号，它将数据类型更改为数字。

BigQuery 需要这种格式：

{"itemId": "12345", "mappingId":"abc123"}

您发送了这种格式：

{"itemId": 12345, "mappingId":abc123}

这种情况下的解决方案。您可以使用命令“astype”从 pandas 转换字段“itemId”。这里有更多关于这个命令的documentation。

这是一个例子。

df['externalId'] = df['externalId'].astype('str')

另一种选择是将参数 table_schema 与 to_gbq method 一起使用。并列出将根据 DataFrame 符合的 Bigquery table 字段。

[{'name': 'col1', 'type': 'STRING'},...]

最后一个选项，你可以改为google-cloud-bigquery而不是pandas-gbq。你可以看到这个comparison.

从 pandas 数据帧导入后，字符串在 BigQuery table 中变为浮动

String becomes float in BiqQuery table after import from pandas dataframe

pandas

google-bigquery

bq

pandas.dataframe.to-gbq