仅从 Libpostal (Pypostal) - PySpark 获取街道和国家
Only get street and country from Libpostal (Pypostal) - PySpark
我正在使用 libpostal
- pypostal
来解析地址,但我只需要数组 ["franklin ave","usa"],["leonard st","united kingdom"]
中的 road
和 country
我怎样才能做到这一点?
Return类型是net.razorvine.pickle.objects.classdictconstructor
from pyspark.sql.functions import udf
LIBPOSTAL_LOADED = False
@udf("string")
def parse(address):
from postal.parser import parse_address
address_parsed = parse_address(address)
return str(address_parsed)
spark.createDataFrame(['781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA','The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom'], "string").toDF("address").select(parse("address")).show(truncate=False)
@MCK 根据要求更新
@udf("array<string>")
def parse(address):
from postal.parser import parse_address
address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]
return address_parsed
+------------------+
|[franklin ave,usa]|
+------------------+
这符合预期
################################################## ##########################
@udf("array<string>")
def parse(address):
from postal.parser import parse_address
address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]
return address_parsed[0]
+-----+
|null |
+-----+
这与预期不符。我希望 address_parsed
的第一个元素是 franklin ave
也许你可以在返回解析地址之前尝试列表理解:
@udf("array<string>")
def parse(address):
from postal.parser import parse_address
address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]
return address_parsed
我正在使用 libpostal
- pypostal
来解析地址,但我只需要数组 ["franklin ave","usa"],["leonard st","united kingdom"]
中的 road
和 country
我怎样才能做到这一点?
Return类型是net.razorvine.pickle.objects.classdictconstructor
from pyspark.sql.functions import udf
LIBPOSTAL_LOADED = False
@udf("string")
def parse(address):
from postal.parser import parse_address
address_parsed = parse_address(address)
return str(address_parsed)
spark.createDataFrame(['781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA','The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom'], "string").toDF("address").select(parse("address")).show(truncate=False)
@MCK 根据要求更新
@udf("array<string>")
def parse(address):
from postal.parser import parse_address
address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]
return address_parsed
+------------------+
|[franklin ave,usa]|
+------------------+
这符合预期 ################################################## ##########################
@udf("array<string>")
def parse(address):
from postal.parser import parse_address
address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]
return address_parsed[0]
+-----+
|null |
+-----+
这与预期不符。我希望 address_parsed
的第一个元素是 franklin ave
也许你可以在返回解析地址之前尝试列表理解:
@udf("array<string>")
def parse(address):
from postal.parser import parse_address
address_parsed = [a[0] for a in parse_address(address) if a[1] in ['road', 'country']]
return address_parsed