如何优化 google-bigquery 以从大数据 table 中查找最常见的类别?

how to optimize google-bigquery for finding most frequent categories from big data table?

我在芝加哥犯罪数据集上使用 google-bigquery。但是,我想从 primary_type 列中为每个不同的 block 找出最常见的犯罪类型。为此,我开始关注 standardSQL.

数据:

由于芝加哥的犯罪数据比较大,有官网可以预览数据集:crime data on Google cloud

我现在的标准SQL:

SELECT primary_type,block, COUNT(*) as count
FROM `bigquery-public-data.chicago_crime.crime` 
HAVING COUNT(*) = (SELECT MAX(count)
  FROM (SELECT primary_type, COUNT(*) as count FROM `bigquery-public-data.chicago_crime.crime` GROUP BY primary_type, block) `bigquery-public-data.chicago_crime.crime`)

我上面的查询的问题是它现在有错误,对我来说,即使我修复了错误,这个查询也很低效。如何修复和优化上述查询?

如何在标准中使用正则表达式 SQL:

要计算每个块最常见的类型,包括北和南,我必须处理 regex,例如 033XX S WOOD ST,我应该只得到 S WOOT ST , 和 033XX N WOOD ST。我需要获取 N WOOD ST 并计算这些值。我该怎么做?

期望输出:

在我想要的输出中,对于每个块,例如 WOOD ST (North (N WOOD ST)South(S WOOD ST))。我想找到最常见的犯罪类型。在我的最终输出中,我期待三列,例如 blockprimary_typecount。有什么方法可以用 google-bigquery 完成这个吗?

以下适用于 BigQuery 标准 SQL

#standardSQL
SELECT
  block,
  ARRAY_AGG(STRUCT(primary_type, cnt) ORDER BY cnt DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT 
    block,
    primary_type, 
    COUNT(*) cnt
  FROM `bigquery-public-data.chicago_crime.crime` 
  GROUP BY block, primary_type
)
GROUP BY block   

how can I get total most frequent crime type on block WOOD ST? any hack to do this?

我不熟悉这些数据的具体细节,但粗略地看了一下 - 我想你可以在下面试试

#standardSQL
SELECT
  block,
  ARRAY_AGG(STRUCT(primary_type, cnt) ORDER BY cnt DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT 
    SUBSTR(block, 8) block,
    primary_type, 
    COUNT(*) cnt
  FROM `bigquery-public-data.chicago_crime.crime` 
  GROUP BY block, primary_type
)
GROUP BY block

这应该为您提供区块中最常见的犯罪

内部查询计数计算犯罪频率,window分区函数根据按块划分的犯罪频率降序计算排名。外部查询 where clause rank =1 return 仅最常见的犯罪。您可以通过使其排名 <=5

来更改外部查询 where 子句以获得前 5 个频繁犯罪
select * from 
      (SELECT block, primary_type, count(primary_type) as crime_frquency, 
            ROW_NUMBER() OVER (PARTITION BY block ORDER BY count(primary_type) DESC) AS rank
       FROM  `bigquery-public-data.chicago_crime.crime` 
       group by block, primary_type)
where rank = 1