在 pandas 中查找前一个交易日非常慢

Finding previous trading day in pandas is very slow

我有这个df

data1_txt = """
date
7/8/2021
7/6/2021
6/29/2021
"""

我需要获取前一个交易日。我设法用 pandas_market_calendars python 包来做到这一点。这是我的全部代码

import io
import pandas_market_calendars as mcal
from pandas.tseries.offsets import CustomBusinessDay
import datetime
import pandas as pd

data1_txt = """
date
7/8/2021
7/6/2021
6/29/2021
"""
df = pd.read_fwf(io.StringIO(data1_txt))
df['date'] = pd.to_datetime(df['date'])

nyse = mcal.get_calendar('NYSE')
holidays = nyse.holidays()
holidays = list(holidays.holidays)
US_BUSINESS_DAY = CustomBusinessDay(holidays=holidays)
df['date_prev'] = df['date'] - 1 * US_BUSINESS_DAY

代码完成了工作。但是对于大型数据集,这个过程非常缓慢。是否有可能以某种方式提高代码速度?

P.S。当我 运行 代码 python 给我这个警告时:

PerformanceWarning: Non-vectorized DateOffset being applied to Series or DatetimeIndex
  warnings.warn(

其实我找到了一个方法,它利用了np.busday_offset

def other(df):
    nyse = mcal.get_calendar('NYSE')
    holidays = nyse.holidays().holidays  

    # check out the docs for how to adjust roll to your preference
    result = np.busday_offset(df["date"].values.astype('datetime64[D]'),     
             [-1], roll= "forward",  weekmask= "1111100", holidays= holidays)
    return result 

在创建一个包含 10,000 行随机日期的 df 并将其传递给您的 other 函数之后。速度从

Stat(s) for 10 execution(s) of yours:
mean: 893.01282 ms
median: 883.1989 ms
stdv: 83.3837 ms
max: 1134.7835 ms
min: 760.0869 ms

至:

Stat(s) for 10 execution(s) of other:
mean: 278.60783 ms
median: 274.44165 ms
stdv: 27.9785 ms
max: 330.3079 ms
min: 235.4329 ms

并使用您的代码:

data1_txt = """
date
7/8/2021
7/6/2021
6/29/2021
"""
df = pd.read_fwf(io.StringIO(data1_txt))
df['date'] = pd.to_datetime(df['date'])

def yours(df):
    nyse = mcal.get_calendar('NYSE')
    holidays = nyse.holidays()
    holidays = list(holidays.holidays)
    US_BUSINESS_DAY = CustomBusinessDay(holidays=holidays)
    result =  df['date'] - 1 * US_BUSINESS_DAY
    return result


display(yours(df), other(df))
>>> 
0   2021-07-07
1   2021-07-02
2   2021-06-28
Name: date, dtype: datetime64[ns]

array(['2021-07-07', '2021-07-02', '2021-06-28'], dtype='datetime64[D]')