Pandas - 如何缩短 pandas 数据帧中函数的执行时间?
Pandas - How can I Improve execution time of a function in a pandas dataframe?
我实际上是在 pandas 数据帧(+50k 行)中执行一些任务,但是 slow.Actually 大约需要 7 秒...
def check_uno(number,area):
if number=='adm':
if area==1:
return 'uno-'+str(area)
else:
return area
else:
return area
%%timeit
df['area_uno']=df.apply(lambda row:check_uno(row['number'],row['area']),axis=1)
df
>>7.16 s ± 1.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
这次我有什么改进的方法吗?
任何帮助将不胜感激!
提前致谢!
用 np.where
试试这个:
df['area'] = df['area'].astype(str)
df['area_uno'] = np.where(df['number'].eq('adm') & df['area'].eq("1"), 'uno-' + df['area'], df['area'])
np.where
比df.apply
快很多,因为NumPy是用C实现的。。。C和Python的速度比较简直是白天和黑夜的比较...
您可以使用 pandas.mask
,它使您能够执行矢量比较:
df['area_uno'] = df.mask(df['number'].eq('adm')&df['area'].eq(1), # if both conditions
'uno-'+df['area'].astype(str) # replace with concatenation of "uno-" and area
)
您可以使用多处理来获得更快的运行时间:
import pandas as pd
import concurrent.futures
import time
start_time = time.time()
def split_df_into_groups_of_fix_size(df, group_size):
lst = [df.iloc[i:i + group_size] for i in range(0, len(df) - group_size + 1, group_size)]
return lst
# number of rows to pass each process (like "batch_size of rows")
group_size = 1
# split df into groups of dataframes with "group_size" rows in each.
lst = split_df_into_groups_of_fix_size(df=df, group_size=group_size)
# number of processes
executor = concurrent.futures.ProcessPoolExecutor(20)
futures = [executor.submit(check_uno, group)
for group in lst]
concurrent.futures.wait(futures)
# concatenate results into one dataframe
result_concat = pd.concat([res.result() for res in futures if res.result() is not None])
print("--- %s seconds ---" % (time.time() - start_time))
我实际上是在 pandas 数据帧(+50k 行)中执行一些任务,但是 slow.Actually 大约需要 7 秒...
def check_uno(number,area):
if number=='adm':
if area==1:
return 'uno-'+str(area)
else:
return area
else:
return area
%%timeit
df['area_uno']=df.apply(lambda row:check_uno(row['number'],row['area']),axis=1)
df
>>7.16 s ± 1.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
这次我有什么改进的方法吗? 任何帮助将不胜感激! 提前致谢!
用 np.where
试试这个:
df['area'] = df['area'].astype(str)
df['area_uno'] = np.where(df['number'].eq('adm') & df['area'].eq("1"), 'uno-' + df['area'], df['area'])
np.where
比df.apply
快很多,因为NumPy是用C实现的。。。C和Python的速度比较简直是白天和黑夜的比较...
您可以使用 pandas.mask
,它使您能够执行矢量比较:
df['area_uno'] = df.mask(df['number'].eq('adm')&df['area'].eq(1), # if both conditions
'uno-'+df['area'].astype(str) # replace with concatenation of "uno-" and area
)
您可以使用多处理来获得更快的运行时间:
import pandas as pd
import concurrent.futures
import time
start_time = time.time()
def split_df_into_groups_of_fix_size(df, group_size):
lst = [df.iloc[i:i + group_size] for i in range(0, len(df) - group_size + 1, group_size)]
return lst
# number of rows to pass each process (like "batch_size of rows")
group_size = 1
# split df into groups of dataframes with "group_size" rows in each.
lst = split_df_into_groups_of_fix_size(df=df, group_size=group_size)
# number of processes
executor = concurrent.futures.ProcessPoolExecutor(20)
futures = [executor.submit(check_uno, group)
for group in lst]
concurrent.futures.wait(futures)
# concatenate results into one dataframe
result_concat = pd.concat([res.result() for res in futures if res.result() is not None])
print("--- %s seconds ---" % (time.time() - start_time))