Pandas 的多线程问题
Multithreading issues with Pandas
我有一个非常大的 excel 文件,其中包含 1000 多个街道交叉口,我需要找到经度和纬度,然后将该信息写入 file/list 以供其他程序使用。
我坚持的是如何使用 multithreading/multiprocessing 构建更高效的脚本,我已经浏览了其他 questions/post 但我发现它有点令人困惑。下面的代码大约需要 10 分钟以上。任何帮助都会很棒。
from geopy.geocoders import ArcGIS
import pandas
from datetime import datetime
start_time = datetime.now()
def my_LatLong(address):
n = ArcGIS().geocode(address)
if n != None:
return n
else:
return [address,"None"]
df = pandas.read_excel("street_sample.xlsx",sheet_name=0)
count = 0
street_list = df.loc[:,"Description"]
new_list =[]
for i in list(street_list):
location = my_LatLong(f"{i.split('-')[0]}, Vancouver, Canada")
if location != None:
new_list.append([f"{len(street_list)}, {list(location)}"])
print(f"{count}/{len(street_list)} - {i} = Completed\t\t\t\t" ,end='\r')
else:
print(f"{count}/{len(street_list)} - {i} == None value \t\t\t",end='\r')
count += 1
# doing something with new_list
endtime_time = datetime.now()
print (f"Program ran for: {endtime_time -start_time}")
street_sample.xlsx
ID
Description
12501x
1900 W Georgia - ONAT_STREET:
12501x
4th/6th Diversion & 6th Ped - ONAT_STREET:
12501x
4th/6th Diversion & 6th Semi - ONAT_STREET:
12501x
Abbott & Cordova - ONAT_STREET:
12501x
Abbott & Expo - ONAT_STREET:
12501x
Abbott & Hastings - ONAT_STREET:
12501x
Abbott & Keefer - ONAT_STREET:
12501x
Aberdeen & Kingsway - ONAT_STREET:
12501x
Alberta & 49th - ONAT_STREET: 70175
12501x
Alder & 12th - ONAT_STREET:
12501x
Alder & 6th - ONAT_STREET:
12501x
Alder & Broadway - ONAT_STREET:
12501x
Alexandra & King Edward - ONAT_STREET:
12501x
Alma & 10th - ONAT_STREET:
12501x
Alma & 4th - ONAT_STREET:
12501x
Alma & 6th - ONAT_STREET:
12501x
Alma & Broadway - ONAT_STREET:
12501x
Alma & Point Grey Road - ONAT_STREET:
12501x
Anderson & 2nd / Lamey's Mill - ONAT_STREET:
12501x
Anderson (Granville) & 4th - ONAT_STREET:
12501x
Angus & 41st - ONAT_STREET:
12501x
Angus & Marine - ONAT_STREET:
12501x
Arbutus & 10th - ONAT_STREET:
12501x
Arbutus & 11th - ONAT_STREET:
12501x
Arbutus & 12th - ONAT_STREET:
12501x
Arbutus & 16th - ONAT_STREET:
12501x
Arbutus & 20th - ONAT_STREET:
12501x
Arbutus & 33rd - ONAT_STREET:
12501x
Arbutus & 4th - ONAT_STREET:
12501x
Arbutus & 8th - ONAT_STREET:
12501x
Arbutus & Broadway - ONAT_STREET:
12501x
Arbutus & Cornwall - ONAT_STREET:
12501x
Arbutus & King Edward - ONAT_STREET:
12501x
Arbutus & Lahb - ONAT_STREET:
问题不是来自 Pandas,而是 ArcGIS().geocode(address)
,它非常慢。事实上,在我的机器上,这一行需要 400 ms/request。每个请求都会向在线 ArcGIS API 发送一个慢速网络查询。使用多处理不会有太大帮助,因为您会很快达到其他限制(API 请求的速率受限,网站饱和)。您需要发送 批量请求 。不幸的是,这似乎不受 geopy
包的支持。如果绑定到 ArcGIS,则需要使用他们自己的 API。您可以找到有关如何执行此操作的更多信息 on the ArcGIS documentation。
我有一个非常大的 excel 文件,其中包含 1000 多个街道交叉口,我需要找到经度和纬度,然后将该信息写入 file/list 以供其他程序使用。
我坚持的是如何使用 multithreading/multiprocessing 构建更高效的脚本,我已经浏览了其他 questions/post 但我发现它有点令人困惑。下面的代码大约需要 10 分钟以上。任何帮助都会很棒。
from geopy.geocoders import ArcGIS
import pandas
from datetime import datetime
start_time = datetime.now()
def my_LatLong(address):
n = ArcGIS().geocode(address)
if n != None:
return n
else:
return [address,"None"]
df = pandas.read_excel("street_sample.xlsx",sheet_name=0)
count = 0
street_list = df.loc[:,"Description"]
new_list =[]
for i in list(street_list):
location = my_LatLong(f"{i.split('-')[0]}, Vancouver, Canada")
if location != None:
new_list.append([f"{len(street_list)}, {list(location)}"])
print(f"{count}/{len(street_list)} - {i} = Completed\t\t\t\t" ,end='\r')
else:
print(f"{count}/{len(street_list)} - {i} == None value \t\t\t",end='\r')
count += 1
# doing something with new_list
endtime_time = datetime.now()
print (f"Program ran for: {endtime_time -start_time}")
street_sample.xlsx
ID | Description |
---|---|
12501x | 1900 W Georgia - ONAT_STREET: |
12501x | 4th/6th Diversion & 6th Ped - ONAT_STREET: |
12501x | 4th/6th Diversion & 6th Semi - ONAT_STREET: |
12501x | Abbott & Cordova - ONAT_STREET: |
12501x | Abbott & Expo - ONAT_STREET: |
12501x | Abbott & Hastings - ONAT_STREET: |
12501x | Abbott & Keefer - ONAT_STREET: |
12501x | Aberdeen & Kingsway - ONAT_STREET: |
12501x | Alberta & 49th - ONAT_STREET: 70175 |
12501x | Alder & 12th - ONAT_STREET: |
12501x | Alder & 6th - ONAT_STREET: |
12501x | Alder & Broadway - ONAT_STREET: |
12501x | Alexandra & King Edward - ONAT_STREET: |
12501x | Alma & 10th - ONAT_STREET: |
12501x | Alma & 4th - ONAT_STREET: |
12501x | Alma & 6th - ONAT_STREET: |
12501x | Alma & Broadway - ONAT_STREET: |
12501x | Alma & Point Grey Road - ONAT_STREET: |
12501x | Anderson & 2nd / Lamey's Mill - ONAT_STREET: |
12501x | Anderson (Granville) & 4th - ONAT_STREET: |
12501x | Angus & 41st - ONAT_STREET: |
12501x | Angus & Marine - ONAT_STREET: |
12501x | Arbutus & 10th - ONAT_STREET: |
12501x | Arbutus & 11th - ONAT_STREET: |
12501x | Arbutus & 12th - ONAT_STREET: |
12501x | Arbutus & 16th - ONAT_STREET: |
12501x | Arbutus & 20th - ONAT_STREET: |
12501x | Arbutus & 33rd - ONAT_STREET: |
12501x | Arbutus & 4th - ONAT_STREET: |
12501x | Arbutus & 8th - ONAT_STREET: |
12501x | Arbutus & Broadway - ONAT_STREET: |
12501x | Arbutus & Cornwall - ONAT_STREET: |
12501x | Arbutus & King Edward - ONAT_STREET: |
12501x | Arbutus & Lahb - ONAT_STREET: |
问题不是来自 Pandas,而是 ArcGIS().geocode(address)
,它非常慢。事实上,在我的机器上,这一行需要 400 ms/request。每个请求都会向在线 ArcGIS API 发送一个慢速网络查询。使用多处理不会有太大帮助,因为您会很快达到其他限制(API 请求的速率受限,网站饱和)。您需要发送 批量请求 。不幸的是,这似乎不受 geopy
包的支持。如果绑定到 ArcGIS,则需要使用他们自己的 API。您可以找到有关如何执行此操作的更多信息 on the ArcGIS documentation。