在 AI Notebook 中使用 AVRO 写入 GCS

Question

总结：

1) 如何将 Pandas 数据帧写入 GCS（Google 云存储）在 Jupyter Notebook（如 AI Notebook）中

2) 在同一个notebook中，如何在Bigquery中调用要上传到新数据集中的对象

问题

我确实有一个对象大到无法将其下载到本地，然后将其写入 GCS -> BQ。但是，该对象不够大，无法使用 Apache-Beam 进行处理。我用BQ magic带进了笔记本。进行一些转换后，我想将一个对象发送回我的数据存储库。因此，我试图使用 AVRO 来复制它，但我不知道如何让它工作。我已经尝试按照本指南（https://github.com/ynqa/pandavro）进行操作，但我还没有想出应该如何拼写该函数。

我正在这样做：

OUTPUT_PATH='{}/resumen2008a2019.avro'.format('gcs://xxxx')
pdx.to_avro(OUTPUT_PATH,df4)

返回以下错误：FileNotFoundError: [Errno 2] No such file or directory: 'gcs://xxxx'

为什么不用 Parquet？ 它无法将数据正确转换为 JSON：ArrowInvalid: ('Could not convert with type str: tried to convert to double', 'Conversion failed for column salario with type object')

为什么不直接呢？ 我尝试使用此 post 作为指南 ()。但它已经三岁了，很多东西不再像那样工作了。

要不要投降，写个经典的ol´ csv？

Answer 1

直接将 DataFrame 写入 BigQuery 得到了非常多的支持，并且可以顺利进行。

假设您正在使用 Google Cloud AI Platform notebook（这样我们就不需要设置服务帐户和安装 bq 包）您可以执行以下操作从 Dataframe 写入 BQ Table:

    client = bigquery.Client(location="US")
    dataset_id = 'your_new_dataset'
    dataset = client.create_dataset(dataset_id) 

    records = [
        {"title": "The Meaning of Life", "release_year": 1983},
        {"title": "Monty Python and the Holy Grail", "release_year": 1975},
        {"title": "Life of Brian", "release_year": 1979},
        {"title": "And Now for Something Completely Different", "release_year": 1971},
    ]

    # Optionally set explicit indices.
    # If indices are not specified, a column will be created for the default
    # indices created by pandas.
    index = ["Q24980", "Q25043", "Q24953", "Q16403"]
    df = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))

    table_ref = dataset.table("monty_python")
    job = client.load_table_from_dataframe(df, table_ref, location="US")

    job.result()  # Waits for table load to complete.
    print("Loaded dataframe to {}".format(table_ref.path))

如果您确实想使用 Pandavro，则需要修改输出路径 "gs://"，因为这不是本地路径，只能写入文件系统的工具无法理解。您基本上必须将其分为以下步骤：

将文件写入本地目录
运行将生成的 avro 文件加载到 BigQuery 中的作业

在 AI Notebook 中使用 AVRO 写入 GCS

Writing to GCS using AVRO within AI Notebook

python-3.x

avro

google-cloud-storage

google-cloud-platform

google-cloud-datalab