如何在不加载 BigQuery 列的情况下将 Python 函数应用于它们？

Question

我有一个包含大约一百万行的 BigQuery 数据集：

我想对其中两列应用 Python 函数而不加载它们 - 这可能吗？

理想情况下，结果应位于新列中。该函数不容易翻译成SQL，具体例子见下文

我为什么要这个？

我想知道每一行（latsE7和lonsE7）中的坐标对在哪个国家。我目前是这样做的：

import geopandas as gpd
from shapely.geometry import Point
from tqdm.notebook import tqdm

加载 GeoPandas 地图（分辨率低但足够好）：

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

查找给定坐标的国家/地区的函数，我想在 BigQuery 中应用的函数：

def country_finder(lat, lon):
    try:
        res =  world[world.geometry.apply(lambda row: row.contains(Point(lon, lat)))].name.iloc[0]
    except IndexError:
        res = "UNCLEAR" # point isn't in any country (i.e. Ocean)
    return res

应用此函数并在将 latsE7 和 lonsE7 列从 BigQuery 加载到列表后得到一个列表：

countrylist = [country_finder(latE7/1e7, lonE7/1e7)
               for latE7, lonE7 in tqdm(zip(latsE7, lonsE7),total=len(latsE7))]

问题是这需要很长时间，正如我从 tqdm 进度条中看到的那样。我可以等待它然后上传到 BigQuery，但我希望有更好的方法来做到这一点。

Answer 1

我会戳一下这不容易翻译 table 到 SQL 的说法。您似乎在描述您的数据 table 和具有国家/地区几何图形的 table 之间的地理空间 JOIN，除非我遗漏了什么。

请特别注意https://cloud.google.com/bigquery/docs/geospatial-data for more details about working with geospatial data in BigQuery. Given your use of Contains() from geopandas I'd point you towards ST_CONTAINS。

如何在不加载 BigQuery 列的情况下将 Python 函数应用于它们？

How to apply Python function to BigQuery columns without loading them?

python

google-bigquery