将数据从 Google Analytics 迁移到 AWS Athena

Question

我在 AWS 中基于 Athena 创建了一个 Datalake，我想查询我现在存储在 Google Analytics 中的数据。据我了解，我无权访问 Analytics 的原始数据，但我可以将其导出到 BigQuery，然后从那里再次将其导出到 GCS（Google 云存储）。我知道我可以创建一个自动化流程来将数据从 Analytics 导出到 BigQuery。

如何（轻松地）创建从 BigQuery 到 GCS 的相同导出？

另外，导出所有历史数据的最简单方法是什么？我看到我可以从 BigQuery 控制台进行导出，但它只导出一天的数据，而且此服务运行ning 已经有一段时间了。

一旦所有数据都在 GCS 中，我想我可以运行 AWS Lambda 将数据复制到我的 AWS 帐户，这样我就可以查询它。

Answer 1

根据文档，您不能在一项作业中从多个 table 中导出数据。

如果您需要自动导出，我建议您使用下面的 Python 脚本。

在使用此脚本之前，请记住您需要为 Python 安装 BigQuery SDK。您可以在终端中通过运行 pip install google-cloud-bigquery 来完成。还要记住，此代码正在考虑您要导出给定数据集中的所有 table。如果数据集中有其他 table，而不是要导出的 table，则需要过滤正确的 table。

from google.cloud import bigquery as bq

# Defining the variables below to make the code cleaner
project_id = "your-project-id"
dataset_id = "your-dataset-id"

# Starting client
client = bq.Client(project=project_id)

# Getting list of tables in your dataset
t = client.list_tables(dataset=dataset_id)
# Creating reference yo your dataset
dataset_ref = bigquery.DatasetReference(project_id, dataset_id)


# The loop below will repeat for all the tables listed in the dataset
# The destination is in the format gs://<your-bucket>/<some-folder>/filename
# The filename is in the format export_<table_name>_<hash>.csv
# This hash is created by the wildcard (*). The wildcard is needed when 
# your export is likely to generate a file bigger than 1 GB


for i in t:
    table_id = i.table_id
    table_ref = dataset_ref.table(table_id)
    destination = "gs://your-bucket/your-folder/"+ table_id + "/export_" + table_id + "_*.csv"
    extract_job = client.extract_table(table_ref, destination, location="US")

将数据从 Google Analytics 迁移到 AWS Athena

Migrate data from Google Analytics to AWS Athena

google-analytics

google-bigquery