是否可以编写一个 BigQuery 来检索 PyPI 下载随时间的分箱计数?
Is it possible to write a BigQuery to retrieve binned counts of PyPI downloads over time?
以下代码是对 google 的 BigQuery 的 SQL 查询,它计算我的 PyPI 包在过去 30 天内被下载的次数。
#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pycotools'
-- Only query the last 30 days of history
AND _TABLE_SUFFIX
BETWEEN FORMAT_DATE(
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
是否可以修改此查询,以便我在包上传后每 30 天获取一次下载次数?输出将是一个 .csv
,看起来像这样:
date count
01-01-2016 10
01-02-2016 20
.. ..
01-05-2018 100
我建议使用 EXTRACT 或 MONTH() 并只计算 file.project 字段,因为它会让查询 运行 更快。您可以使用的查询是:
#standardSQL
SELECT
EXTRACT(MONTH FROM _PARTITIONDATE) AS month_,
EXTRACT(YEAR FROM _PARTITIONDATE) AS year_,
count(file.project) as count
FROM
`the-psf.pypi.downloads*`
WHERE
file.project= 'pycotools'
GROUP BY 1, 2
ORDER by 1 ASC
我用 public 数据集试了一下:
#standardSQL
SELECT
EXTRACT(MONTH FROM pickup_datetime) AS month_,
EXTRACT(YEAR FROM pickup_datetime) AS year_,
count(rate_code) as count
FROM
`nyc-tlc.green.trips_2015`
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC
或使用旧版
SELECT
MONTH(pickup_datetime) AS month_,
YEAR(pickup_datetime) AS year_,
count(rate_code) as count
FROM
[nyc-tlc:green.trips_2015]
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC
结果是:
month_ year_ count
1 2015 34228
2 2015 36366
3 2015 42221
4 2015 41159
5 2015 41934
6 2015 39506
我看到你正在使用 _TABLE_SUFFIX,所以如果你正在查询分区 table,你可以使用 _PARTITIONDATE 列而不是格式化日期和使用 date_sub功能。这也将使用更少的计算时间。
从one partition查询:
SELECT
[COLUMN]
FROM
[DATASET].[TABLE]
WHERE
_PARTITIONDATE BETWEEN '2016-01-01'
AND '2016-01-02'
以下代码是对 google 的 BigQuery 的 SQL 查询,它计算我的 PyPI 包在过去 30 天内被下载的次数。
#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pycotools'
-- Only query the last 30 days of history
AND _TABLE_SUFFIX
BETWEEN FORMAT_DATE(
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
是否可以修改此查询,以便我在包上传后每 30 天获取一次下载次数?输出将是一个 .csv
,看起来像这样:
date count
01-01-2016 10
01-02-2016 20
.. ..
01-05-2018 100
我建议使用 EXTRACT 或 MONTH() 并只计算 file.project 字段,因为它会让查询 运行 更快。您可以使用的查询是:
#standardSQL
SELECT
EXTRACT(MONTH FROM _PARTITIONDATE) AS month_,
EXTRACT(YEAR FROM _PARTITIONDATE) AS year_,
count(file.project) as count
FROM
`the-psf.pypi.downloads*`
WHERE
file.project= 'pycotools'
GROUP BY 1, 2
ORDER by 1 ASC
我用 public 数据集试了一下:
#standardSQL
SELECT
EXTRACT(MONTH FROM pickup_datetime) AS month_,
EXTRACT(YEAR FROM pickup_datetime) AS year_,
count(rate_code) as count
FROM
`nyc-tlc.green.trips_2015`
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC
或使用旧版
SELECT
MONTH(pickup_datetime) AS month_,
YEAR(pickup_datetime) AS year_,
count(rate_code) as count
FROM
[nyc-tlc:green.trips_2015]
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC
结果是:
month_ year_ count
1 2015 34228
2 2015 36366
3 2015 42221
4 2015 41159
5 2015 41934
6 2015 39506
我看到你正在使用 _TABLE_SUFFIX,所以如果你正在查询分区 table,你可以使用 _PARTITIONDATE 列而不是格式化日期和使用 date_sub功能。这也将使用更少的计算时间。
从one partition查询:
SELECT
[COLUMN]
FROM
[DATASET].[TABLE]
WHERE
_PARTITIONDATE BETWEEN '2016-01-01'
AND '2016-01-02'