ConcurrentFutures ThreadPoolExecuter 未完成 pd.DataFrame.append
ConcurrentFutures ThreadPoolExecuter not finishing pd.DataFrame.append
在使用 python ThreadPoolExecutor 并通过执行一些网络请求遍历列表时,我遇到了一个问题,即我的工作人员在任务标记为完成之前没有完成。
如果您使用 for 循环和 ThreadPoolExecutor 执行相同的任务,我的 DataFrame 的长度随 ThreadPoolExecutor 的不同而不同。 For 循环始终执行所有任务。
有没有问题,或者要添加到 ThreadPoolExecutor 才能正常工作?
import pandas as pd
import time
import concurrent.futures
columns = ['name']
data = pd.DataFrame(columns = columns)
persons = ['Tom', 'Mike', 'Susan', 'David', 'Ellen']
def update(person):
global data
time.sleep(0.2)
data = data.append(pd.DataFrame({'name': person}, index=[person]))
for x in persons:
update(x)
print(len(data))
data = pd.DataFrame(columns = columns)
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(update, persons)
print(len(data))
来自文档:
As of pandas 0.11, pandas is not 100% thread safe. The known issues
relate to the copy() method. If you are doing a lot of copying of
DataFrame objects shared among threads, we recommend holding locks
inside the threads where the data copying occurs.
在使用 python ThreadPoolExecutor 并通过执行一些网络请求遍历列表时,我遇到了一个问题,即我的工作人员在任务标记为完成之前没有完成。
如果您使用 for 循环和 ThreadPoolExecutor 执行相同的任务,我的 DataFrame 的长度随 ThreadPoolExecutor 的不同而不同。 For 循环始终执行所有任务。
有没有问题,或者要添加到 ThreadPoolExecutor 才能正常工作?
import pandas as pd
import time
import concurrent.futures
columns = ['name']
data = pd.DataFrame(columns = columns)
persons = ['Tom', 'Mike', 'Susan', 'David', 'Ellen']
def update(person):
global data
time.sleep(0.2)
data = data.append(pd.DataFrame({'name': person}, index=[person]))
for x in persons:
update(x)
print(len(data))
data = pd.DataFrame(columns = columns)
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(update, persons)
print(len(data))
来自文档:
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.