如何将 IP 地址转换为 BigQuery 标准中的地理位置 SQL?
How to transform IP addresses into geolocation in BigQuery standard SQL?
所以我读了https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html
但我想知道是否有 #standardSQL
方法可以做到这一点。到目前为止,我在转换 PARSE_IP 和 NTH() 时遇到了很多挑战,因为迁移文档中建议的更改有局限性。
从 PARSE_IP(contributor_ip)
到 NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip))
不适用于 IPv6 IP 地址。
从 NTH(1, latitude) lat
到 latitude[SAFE_ORDINAL(1)]
不起作用,因为纬度被视为字符串。
并且可能还有更多我尚未遇到的迁移问题。有谁知道如何将 IP 地址转换为 BigQuery 标准中的地理位置 SQL?
P.S。我将如何从地理定位到确定时区?
编辑:那么这之间有什么区别呢
#legacySQL
SELECT
COUNT(*) c,
city,
countryLabel,
NTH(1, latitude) lat,
NTH(1, longitude) lng
FROM (
SELECT
INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
FROM
[publicdata:samples.wikipedia]
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN EACH
[fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
和
SELECT
COUNT(*) c,
city,
countryLabel,
ANY_VALUE(latitude) lat,
ANY_VALUE(longitude) lng
FROM (
SELECT
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
ELSE NULL
END AS clientIpNum,
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64)
ELSE NULL
END AS classB
FROM
`publicdata.samples.wikipedia`
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
edit2:似乎我设法通过不正确地投射浮点数来解决问题。现在,标准 SQL returns 41815 行而不是遗留 SQL 中的 56347 行,这可能是由于标准 SQL 缺少从 IPv6 到 int 的转换,但这可能是由于其他原因。此外,遗留 SQL 查询的性能要好得多,运行 大约需要 10 秒,而不是标准 SQL.
的整分钟
根据https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7
#standardSQL
SELECT
id,
IFNULL(city, 'Other') AS city,
IFNULL(countryLabel, 'Other') AS countryLabel,
latitude,
longitude
FROM (
SELECT
id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
FROM
`<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
id ASC
此问题的答案对 ipv6 地址无效。
按照此处描述的方法 https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 我想到了这个解决方案:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
为了获得 geoip_city_v4
和 geoip_city_v4
您需要从 https://maxmind.com/
下载 geoip 数据库
您可以按照本教程更新和准备数据集 https://hodo.dev/posts/post37-gcp-bigquery-geoip/。
所以我读了https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html
但我想知道是否有 #standardSQL
方法可以做到这一点。到目前为止,我在转换 PARSE_IP 和 NTH() 时遇到了很多挑战,因为迁移文档中建议的更改有局限性。
从 PARSE_IP(contributor_ip)
到 NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip))
不适用于 IPv6 IP 地址。
从 NTH(1, latitude) lat
到 latitude[SAFE_ORDINAL(1)]
不起作用,因为纬度被视为字符串。
并且可能还有更多我尚未遇到的迁移问题。有谁知道如何将 IP 地址转换为 BigQuery 标准中的地理位置 SQL?
P.S。我将如何从地理定位到确定时区?
编辑:那么这之间有什么区别呢
#legacySQL
SELECT
COUNT(*) c,
city,
countryLabel,
NTH(1, latitude) lat,
NTH(1, longitude) lng
FROM (
SELECT
INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
FROM
[publicdata:samples.wikipedia]
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN EACH
[fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
和
SELECT
COUNT(*) c,
city,
countryLabel,
ANY_VALUE(latitude) lat,
ANY_VALUE(longitude) lng
FROM (
SELECT
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
ELSE NULL
END AS clientIpNum,
CASE
WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64)
ELSE NULL
END AS classB
FROM
`publicdata.samples.wikipedia`
WHERE
contributor_ip IS NOT NULL ) AS a
JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
WHERE
a.clientIpNum BETWEEN b.startIpNum
AND b.endIpNum
AND city != ''
GROUP BY
city,
countryLabel
ORDER BY
1 DESC
edit2:似乎我设法通过不正确地投射浮点数来解决问题。现在,标准 SQL returns 41815 行而不是遗留 SQL 中的 56347 行,这可能是由于标准 SQL 缺少从 IPv6 到 int 的转换,但这可能是由于其他原因。此外,遗留 SQL 查询的性能要好得多,运行 大约需要 10 秒,而不是标准 SQL.
的整分钟根据https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7
#standardSQL
SELECT
id,
IFNULL(city, 'Other') AS city,
IFNULL(countryLabel, 'Other') AS countryLabel,
latitude,
longitude
FROM (
SELECT
id,
NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
FROM
`<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
`fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
a.classB = b.classB
AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
id ASC
此问题的答案对 ipv6 地址无效。
按照此处描述的方法 https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 我想到了这个解决方案:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
为了获得 geoip_city_v4
和 geoip_city_v4
您需要从 https://maxmind.com/
您可以按照本教程更新和准备数据集 https://hodo.dev/posts/post37-gcp-bigquery-geoip/。