仅从 Libpostal (Pypostal) - PySpark 获取街道和国家

Only get street and country from Libpostal (Pypostal) - PySpark

我正在使用 libpostal - pypostal 来解析地址,但我只需要数组 ["franklin ave","usa"],["leonard st","united kingdom"] 中的 roadcountry

我怎样才能做到这一点?

Return类型是net.razorvine.pickle.objects.classdictconstructor

from pyspark.sql.functions import udf

LIBPOSTAL_LOADED = False
@udf("string")
def parse(address):
   from postal.parser import parse_address

   address_parsed = parse_address(address)

   return str(address_parsed)

spark.createDataFrame(['781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA','The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom'], "string").toDF("address").select(parse("address")).show(truncate=False)

@MCK 根据要求更新

@udf("array<string>")
def parse(address):
   from postal.parser import parse_address

   address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]

   return address_parsed

+------------------+
|[franklin ave,usa]|
+------------------+

这符合预期 ################################################## ##########################

@udf("array<string>")
def parse(address):
   from postal.parser import parse_address

   address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]

   return address_parsed[0]

+-----+
|null |
+-----+

这与预期不符。我希望 address_parsed 的第一个元素是 franklin ave

也许你可以在返回解析地址之前尝试列表理解:

@udf("array<string>")
def parse(address):
   from postal.parser import parse_address

   address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]

   return address_parsed