Pandas 的多线程问题

Multithreading issues with Pandas

我有一个非常大的 excel 文件,其中包含 1000 多个街道交叉口,我需要找到经度和纬度,然后将该信息写入 file/list 以供其他程序使用。

我坚持的是如何使用 multithreading/multiprocessing 构建更高效的脚本,我已经浏览了其他 questions/post 但我发现它有点令人困惑。下面的代码大约需要 10 分钟以上。任何帮助都会很棒。

from geopy.geocoders import ArcGIS
import pandas
from datetime import datetime


start_time = datetime.now()
def my_LatLong(address):
    n = ArcGIS().geocode(address)
    if n != None: 
        return n
    else:
        return [address,"None"]


df = pandas.read_excel("street_sample.xlsx",sheet_name=0)
count = 0
street_list = df.loc[:,"Description"]

new_list =[]
for i in list(street_list):
    location = my_LatLong(f"{i.split('-')[0]}, Vancouver, Canada")
    if location != None:
        new_list.append([f"{len(street_list)}, {list(location)}"])
        print(f"{count}/{len(street_list)} - {i} = Completed\t\t\t\t" ,end='\r')
    else:
        print(f"{count}/{len(street_list)} - {i} == None value \t\t\t",end='\r')

    count += 1

# doing something with new_list

endtime_time = datetime.now()

print (f"Program ran for:  {endtime_time -start_time}")

street_sample.xlsx

ID Description
12501x 1900 W Georgia - ONAT_STREET:
12501x 4th/6th Diversion & 6th Ped - ONAT_STREET:
12501x 4th/6th Diversion & 6th Semi - ONAT_STREET:
12501x Abbott & Cordova - ONAT_STREET:
12501x Abbott & Expo - ONAT_STREET:
12501x Abbott & Hastings - ONAT_STREET:
12501x Abbott & Keefer - ONAT_STREET:
12501x Aberdeen & Kingsway - ONAT_STREET:
12501x Alberta & 49th - ONAT_STREET: 70175
12501x Alder & 12th - ONAT_STREET:
12501x Alder & 6th - ONAT_STREET:
12501x Alder & Broadway - ONAT_STREET:
12501x Alexandra & King Edward - ONAT_STREET:
12501x Alma & 10th - ONAT_STREET:
12501x Alma & 4th - ONAT_STREET:
12501x Alma & 6th - ONAT_STREET:
12501x Alma & Broadway - ONAT_STREET:
12501x Alma & Point Grey Road - ONAT_STREET:
12501x Anderson & 2nd / Lamey's Mill - ONAT_STREET:
12501x Anderson (Granville) & 4th - ONAT_STREET:
12501x Angus & 41st - ONAT_STREET:
12501x Angus & Marine - ONAT_STREET:
12501x Arbutus & 10th - ONAT_STREET:
12501x Arbutus & 11th - ONAT_STREET:
12501x Arbutus & 12th - ONAT_STREET:
12501x Arbutus & 16th - ONAT_STREET:
12501x Arbutus & 20th - ONAT_STREET:
12501x Arbutus & 33rd - ONAT_STREET:
12501x Arbutus & 4th - ONAT_STREET:
12501x Arbutus & 8th - ONAT_STREET:
12501x Arbutus & Broadway - ONAT_STREET:
12501x Arbutus & Cornwall - ONAT_STREET:
12501x Arbutus & King Edward - ONAT_STREET:
12501x Arbutus & Lahb - ONAT_STREET:

问题不是来自 Pandas,而是 ArcGIS().geocode(address),它非常慢。事实上,在我的机器上,这一行需要 400 ms/request。每个请求都会向在线 ArcGIS API 发送一个慢速网络查询。使用多处理不会有太大帮助,因为您会很快达到其他限制(API 请求的速率受限,网站饱和)。您需要发送 批量请求 。不幸的是,这似乎不受 geopy 包的支持。如果绑定到 ArcGIS,则需要使用他们自己的 API。您可以找到有关如何执行此操作的更多信息 on the ArcGIS documentation