Pandas - 如何缩短 pandas 数据帧中函数的执行时间?

Pandas - How can I Improve execution time of a function in a pandas dataframe?

我实际上是在 pandas 数据帧(+50k 行)中执行一些任务,但是 slow.Actually 大约需要 7 秒...

def check_uno(number,area):
    if number=='adm':
        if area==1:
            return 'uno-'+str(area)
        else:
            return area
    else:
        return area
    
%%timeit
df['area_uno']=df.apply(lambda row:check_uno(row['number'],row['area']),axis=1)
df
>>7.16 s ± 1.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

这次我有什么改进的方法吗? 任何帮助将不胜感激! 提前致谢!

np.where 试试这个:

df['area'] = df['area'].astype(str)
df['area_uno'] = np.where(df['number'].eq('adm') & df['area'].eq("1"), 'uno-' + df['area'], df['area'])

np.wheredf.apply快很多,因为NumPy是用C实现的。。。C和Python的速度比较简直是白天和黑夜的比较...

您可以使用 pandas.mask,它使您能够执行矢量比较:

df['area_uno'] = df.mask(df['number'].eq('adm')&df['area'].eq(1), # if both conditions
                         'uno-'+df['area'].astype(str) # replace with concatenation of "uno-" and area
                         )

您可以使用多处理来获得更快的运行时间:

import pandas as pd
import concurrent.futures
import time

start_time = time.time()

def split_df_into_groups_of_fix_size(df, group_size):
    lst = [df.iloc[i:i + group_size] for i in range(0, len(df) - group_size + 1, group_size)]
    return lst


# number of rows to pass each process (like "batch_size of rows")
group_size = 1

# split df into groups of dataframes with "group_size" rows in each.
lst = split_df_into_groups_of_fix_size(df=df, group_size=group_size)
# number of processes
executor = concurrent.futures.ProcessPoolExecutor(20)
futures = [executor.submit(check_uno, group)
           for group in lst]
concurrent.futures.wait(futures)

# concatenate results into one dataframe
result_concat = pd.concat([res.result() for res in futures if res.result() is not None])
print("--- %s seconds ---" % (time.time() - start_time))