ConcurrentFutures ThreadPoolExecuter 未完成 pd.DataFrame.append

Question

在使用 python ThreadPoolExecutor 并通过执行一些网络请求遍历列表时，我遇到了一个问题，即我的工作人员在任务标记为完成之前没有完成。

如果您使用 for 循环和 ThreadPoolExecutor 执行相同的任务，我的 DataFrame 的长度随 ThreadPoolExecutor 的不同而不同。 For 循环始终执行所有任务。

有没有问题，或者要添加到 ThreadPoolExecutor 才能正常工作？

import pandas as pd
import time
import concurrent.futures


columns = ['name']
data = pd.DataFrame(columns = columns)
persons = ['Tom', 'Mike', 'Susan', 'David', 'Ellen']

def update(person):
    global data
    time.sleep(0.2)
    data = data.append(pd.DataFrame({'name': person}, index=[person]))


for x in persons:
    update(x)
print(len(data))
data = pd.DataFrame(columns = columns)

with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(update, persons)
print(len(data))

Answer 1

来自文档：

As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.

ConcurrentFutures ThreadPoolExecuter 未完成 pd.DataFrame.append

ConcurrentFutures ThreadPoolExecuter not finishing pd.DataFrame.append

python

iteration

for-loop

threadpool

python-requests