评估 Return 的时间相关率以创建 Pandas DataFrame

Evaluate Time Dependent Rate of Return to create Pandas DataFrame

假设我有一个 Pandas 数据框如下:

+------------+--------+
|    Date    | Price  |
+------------+--------+
| 2021-07-30 | 438.51 |
| 2021-08-02 | 437.59 |
| 2021-08-03 | 441.15 |
| 2021-08-04 | 438.98 |
+------------+--------+

可以使用以下代码生成上述数据框:

data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
        'Price': [438.51, 437.59, 441.15, 438.98]
        }

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25

对于给定的时间序列,我想计算依赖于时间的 rate_of_return,这里的问题是确定达到 rate_of_return 的最佳值或最差值的时间段。

可以简单地计算所有可能组合的 rate_of_return,然后创建一个包含 period_start period_endrate_of_return 的数据框,并按降序(最佳)或升序排序(最差)顺序,然后排除任何重叠的时期。

rate_of_return = ((period_end_price/period_start_price)^(compounding_days/(days_in_between))-1 * (normalisation_days/compounding_days)

在上面的数据框中,我使用下面的代码计算了 rate_of_return

df['rate_of_return_l1'] = ((((df.Price /
                                   df.Price[0]) **
                                  (compounding_days /
                                   (df.Date - df.Date[0]).dt.days) - 1) *
                                 (normalisation_days /
                                  compounding_days)))
df['rate_of_return_l1'].iloc[0] = np.nan

df['rate_of_return_l2'] = ((((df.Price /
                                   df.Price[1]) **
                                  (compounding_days /
                                   (df.Date - df.Date[1]).dt.days) - 1) *
                                 (normalisation_days /
                                  compounding_days)))
df['rate_of_return_l2'].iloc[:2] = np.nan

df['rate_of_return_l3'] = ((((df.Price /
                                   df.Price[2]) **
                                  (compounding_days /
                                   (df.Date - df.Date[2]).dt.days) - 1) *
                                 (normalisation_days /
                                  compounding_days)))
df['rate_of_return_l3'].iloc[:3] = np.nan

根据结果,best/worst 个案例周期如下

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02   | 2021-08-03 |    18.28751739 |
| 2021-08-02   | 2021-08-04 |    0.784586925 |
| 2021-07-30   | 2021-08-03 |    0.729942907 |
| 2021-07-30   | 2021-08-04 |    0.081397181 |
| 2021-07-30   | 2021-08-02 |   -0.225626914 |
| 2021-08-03   | 2021-08-04 |   -0.834880227 |
+--------------+------------+----------------+

预期输出

如果我想看到 rate_of_return 中最好的结果数据帧将是

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02   | 2021-08-03 |    18.28751739 |
+--------------+------------+----------------+

如果我想查看 rate_of_return 中最差的情况,则生成的数据帧将是

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03   | 2021-08-04 |   -0.834880227 |
| 2021-07-30   | 2021-08-02 |   -0.225626914 |
+--------------+------------+----------------+

定义你的函数,你可以直接传递数据框和开始、结束日期:

import numpy as np
import pandas as pd

data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
        'Price': [438.51, 437.59, 441.15, 438.98]
        }

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25

def rate_ret(df, start_date, end_date):

    start = df[df.Date==start_date].iloc[0]
    end = df[df.Date==end_date].iloc[0]
    period_start_price = start.Price
    period_end_price = end.Price
    days_in_between = (end.Date - start.Date).days
    return ((period_end_price/period_start_price)**(compounding_days/days_in_between)-1) * (normalisation_days/compounding_days)

# Iterate over all possible date intervals creating an array (or matrix),
#in the second `for` loop, we only include dates bigger than the starting date:

array = []
for start_date in df.Date:
    for end_date in df.Date[df.Date>start_date]:
        array.append([rate_ret(df, start_date, end_date), start_date, end_date])
print(array)

# To extract the best and the worst periods with no overlapping, 
# take the best save it and iteratively save the next comparing if they collide or not with the previous stored intervals:

def extract_non_overlaping(df):
    saved_rows = [df.iloc[0]]
    for i,row in df.iterrows():
        for saved in saved_rows:
            if (row['Period End'] < saved['Period Start']) or (row['Period Start'] > saved['Period End']):
                saved_rows.append(row)
                break # avoid saving duplicates
    return pd.DataFrame(saved_rows, columns=['Rate of Return','Period Start','Period End'])

df_higher  = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'],ascending=False)
df_lower  = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'])

extract_non_overlaping(df_higher)
extract_non_overlaping(df_lower)

而结果较低:

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02   | 2021-08-03 |    18.28751739 |
+--------------+------------+----------------+

更高:

+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03   | 2021-08-04 |   -0.834880227 |
| 2021-07-30   | 2021-08-02 |   -0.225626914 |
+--------------+------------+----------------+

如果公式不依赖于时间,只需更改 rete_ret 定义中的公式即可。

pd:您可以进行一些优化,但总体而言代码有效。

如果我没理解错的话,你的问题分为两部分-

第 1 部分:生成组合

要生成组合,您可以使用 itertools,计算每个组合的 returns 并对结果进行排序。

from itertools import combinations
rors = []
for combination in combinations(zip(df['Date'], df['Price']), 2):
    (start_date, start_price), (end_date, end_price) = combination
    ror = (end_price / start_price) ** (compounding_days / (end_date - start_date).days) - 1
    rors.append((start_date, end_date, ror))

sorted_rors = sorted(rors, key=lambda x: x[2], reverse=True)
print(sorted_rors[0])
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)

print(sorted_rors[-1])
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)

第 2 部分:非重叠时间段

这部分我不是很清楚,不过我猜你是想找top nreturns 具有不重叠的时间段。如果你看的时间段数比较多,可以考虑使用生成器函数 -

def next_non_overlapping(iterable):
    it = iter(iterable)
    first_start, first_end, first_ror = next(it)
    yield (first_start, first_end, first_ror)
    while True:
        try:
            next_start, next_end, next_ror = next(it)
            if next_start >= first_end or next_end <= first_start:
                yield (next_start, next_end, next_ror)
                first_start, first_end, first_ror = next_start, next_end, next_ror
        except StopIteration:
            print("No more items")
            break

nno = next_non_overlapping(sorted_rors)
print(next(nno))
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(next(nno))
#(Timestamp('2021-07-30 00:00:00'),
# Timestamp('2021-08-02 00:00:00'),
# -0.22562691374181088)
print(next(nno))
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)
print(next(nno))
# No more items

为了获得最低的 n returns,您可以简单地将反向列表传递给函数 - 即

nnor = next_non_overlapping(reversed(sorted_rors))

首先,如果时间序列是每日的,问题就更容易了。所以我会这样做:

df.set_index('Date').resample('d').mean().reset_index()

这让我们:

Date Price
0 2021-07-30 00:00:00 438.51
1 2021-07-31 00:00:00 nan
2 2021-08-01 00:00:00 nan
3 2021-08-02 00:00:00 437.59
4 2021-08-03 00:00:00 441.15
5 2021-08-04 00:00:00 438.98

从这里您可以计算出未来 return 到 x-days 的比率:

for holding_duration in range(1, 5):
    df[holding_duration] = df['Price'].pct_change(holding_duration).add(1).pow(365.25/holding_duration)

这给出:

Date Price 1 2 3 4
0 2021-07-30 00:00:00 438.51 nan nan nan nan
1 2021-07-31 00:00:00 nan nan nan nan nan
2 2021-08-01 00:00:00 nan nan nan nan nan
3 2021-08-02 00:00:00 437.59 0.464356 nan nan nan
4 2021-08-03 00:00:00 441.15 19.2875 2.9927 nan nan
5 2021-08-04 00:00:00 438.98 0.16512 1.78459 1.13931 nan

这可能会变得相当大...

从那里你可以做一个 row-wise argmax 并从中推导出持有期。

不是完整的解决方案,但也许有帮助。