评估 Return 的时间相关率以创建 Pandas DataFrame
Evaluate Time Dependent Rate of Return to create Pandas DataFrame
假设我有一个 Pandas 数据框如下:
+------------+--------+
| Date | Price |
+------------+--------+
| 2021-07-30 | 438.51 |
| 2021-08-02 | 437.59 |
| 2021-08-03 | 441.15 |
| 2021-08-04 | 438.98 |
+------------+--------+
可以使用以下代码生成上述数据框:
data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
'Price': [438.51, 437.59, 441.15, 438.98]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25
对于给定的时间序列,我想计算依赖于时间的 rate_of_return
,这里的问题是确定达到 rate_of_return
的最佳值或最差值的时间段。
可以简单地计算所有可能组合的 rate_of_return
,然后创建一个包含 period_start
period_end
和 rate_of_return
的数据框,并按降序(最佳)或升序排序(最差)顺序,然后排除任何重叠的时期。
rate_of_return = ((period_end_price/period_start_price)^(compounding_days/(days_in_between))-1 * (normalisation_days/compounding_days)
在上面的数据框中,我使用下面的代码计算了 rate_of_return
df['rate_of_return_l1'] = ((((df.Price /
df.Price[0]) **
(compounding_days /
(df.Date - df.Date[0]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l1'].iloc[0] = np.nan
df['rate_of_return_l2'] = ((((df.Price /
df.Price[1]) **
(compounding_days /
(df.Date - df.Date[1]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l2'].iloc[:2] = np.nan
df['rate_of_return_l3'] = ((((df.Price /
df.Price[2]) **
(compounding_days /
(df.Date - df.Date[2]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l3'].iloc[:3] = np.nan
根据结果,best/worst 个案例周期如下
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02 | 2021-08-03 | 18.28751739 |
| 2021-08-02 | 2021-08-04 | 0.784586925 |
| 2021-07-30 | 2021-08-03 | 0.729942907 |
| 2021-07-30 | 2021-08-04 | 0.081397181 |
| 2021-07-30 | 2021-08-02 | -0.225626914 |
| 2021-08-03 | 2021-08-04 | -0.834880227 |
+--------------+------------+----------------+
预期输出
如果我想看到 rate_of_return
中最好的结果数据帧将是
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02 | 2021-08-03 | 18.28751739 |
+--------------+------------+----------------+
如果我想查看 rate_of_return
中最差的情况,则生成的数据帧将是
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03 | 2021-08-04 | -0.834880227 |
| 2021-07-30 | 2021-08-02 | -0.225626914 |
+--------------+------------+----------------+
- 我们测试所有场景以进行计算的最佳方法是什么
rate_of_return
?
- 我怎样才能达到预期的产出,使周期不重叠? (在预期输出中看到)
- Best/Worst 数据帧不依赖于符号最好的数据帧可以包含负数
rate_of_returns
假设没有时间段重叠。
- 如果公式更改为
(period_end_price/period_start_price) - 1
(不依赖于时间),方法是什么?
定义你的函数,你可以直接传递数据框和开始、结束日期:
import numpy as np
import pandas as pd
data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
'Price': [438.51, 437.59, 441.15, 438.98]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25
def rate_ret(df, start_date, end_date):
start = df[df.Date==start_date].iloc[0]
end = df[df.Date==end_date].iloc[0]
period_start_price = start.Price
period_end_price = end.Price
days_in_between = (end.Date - start.Date).days
return ((period_end_price/period_start_price)**(compounding_days/days_in_between)-1) * (normalisation_days/compounding_days)
# Iterate over all possible date intervals creating an array (or matrix),
#in the second `for` loop, we only include dates bigger than the starting date:
array = []
for start_date in df.Date:
for end_date in df.Date[df.Date>start_date]:
array.append([rate_ret(df, start_date, end_date), start_date, end_date])
print(array)
# To extract the best and the worst periods with no overlapping,
# take the best save it and iteratively save the next comparing if they collide or not with the previous stored intervals:
def extract_non_overlaping(df):
saved_rows = [df.iloc[0]]
for i,row in df.iterrows():
for saved in saved_rows:
if (row['Period End'] < saved['Period Start']) or (row['Period Start'] > saved['Period End']):
saved_rows.append(row)
break # avoid saving duplicates
return pd.DataFrame(saved_rows, columns=['Rate of Return','Period Start','Period End'])
df_higher = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'],ascending=False)
df_lower = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'])
extract_non_overlaping(df_higher)
extract_non_overlaping(df_lower)
而结果较低:
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02 | 2021-08-03 | 18.28751739 |
+--------------+------------+----------------+
更高:
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03 | 2021-08-04 | -0.834880227 |
| 2021-07-30 | 2021-08-02 | -0.225626914 |
+--------------+------------+----------------+
如果公式不依赖于时间,只需更改 rete_ret 定义中的公式即可。
pd:您可以进行一些优化,但总体而言代码有效。
如果我没理解错的话,你的问题分为两部分-
第 1 部分:生成组合
要生成组合,您可以使用 itertools
,计算每个组合的 returns 并对结果进行排序。
from itertools import combinations
rors = []
for combination in combinations(zip(df['Date'], df['Price']), 2):
(start_date, start_price), (end_date, end_price) = combination
ror = (end_price / start_price) ** (compounding_days / (end_date - start_date).days) - 1
rors.append((start_date, end_date, ror))
sorted_rors = sorted(rors, key=lambda x: x[2], reverse=True)
print(sorted_rors[0])
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(sorted_rors[-1])
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)
第 2 部分:非重叠时间段
这部分我不是很清楚,不过我猜你是想找top nreturns 具有不重叠的时间段。如果你看的时间段数比较多,可以考虑使用生成器函数 -
def next_non_overlapping(iterable):
it = iter(iterable)
first_start, first_end, first_ror = next(it)
yield (first_start, first_end, first_ror)
while True:
try:
next_start, next_end, next_ror = next(it)
if next_start >= first_end or next_end <= first_start:
yield (next_start, next_end, next_ror)
first_start, first_end, first_ror = next_start, next_end, next_ror
except StopIteration:
print("No more items")
break
nno = next_non_overlapping(sorted_rors)
print(next(nno))
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(next(nno))
#(Timestamp('2021-07-30 00:00:00'),
# Timestamp('2021-08-02 00:00:00'),
# -0.22562691374181088)
print(next(nno))
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)
print(next(nno))
# No more items
为了获得最低的 n returns,您可以简单地将反向列表传递给函数 - 即
nnor = next_non_overlapping(reversed(sorted_rors))
首先,如果时间序列是每日的,问题就更容易了。所以我会这样做:
df.set_index('Date').resample('d').mean().reset_index()
这让我们:
Date
Price
0
2021-07-30 00:00:00
438.51
1
2021-07-31 00:00:00
nan
2
2021-08-01 00:00:00
nan
3
2021-08-02 00:00:00
437.59
4
2021-08-03 00:00:00
441.15
5
2021-08-04 00:00:00
438.98
从这里您可以计算出未来 return 到 x-days 的比率:
for holding_duration in range(1, 5):
df[holding_duration] = df['Price'].pct_change(holding_duration).add(1).pow(365.25/holding_duration)
这给出:
Date
Price
1
2
3
4
0
2021-07-30 00:00:00
438.51
nan
nan
nan
nan
1
2021-07-31 00:00:00
nan
nan
nan
nan
nan
2
2021-08-01 00:00:00
nan
nan
nan
nan
nan
3
2021-08-02 00:00:00
437.59
0.464356
nan
nan
nan
4
2021-08-03 00:00:00
441.15
19.2875
2.9927
nan
nan
5
2021-08-04 00:00:00
438.98
0.16512
1.78459
1.13931
nan
这可能会变得相当大...
从那里你可以做一个 row-wise argmax 并从中推导出持有期。
不是完整的解决方案,但也许有帮助。
假设我有一个 Pandas 数据框如下:
+------------+--------+
| Date | Price |
+------------+--------+
| 2021-07-30 | 438.51 |
| 2021-08-02 | 437.59 |
| 2021-08-03 | 441.15 |
| 2021-08-04 | 438.98 |
+------------+--------+
可以使用以下代码生成上述数据框:
data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
'Price': [438.51, 437.59, 441.15, 438.98]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25
对于给定的时间序列,我想计算依赖于时间的 rate_of_return
,这里的问题是确定达到 rate_of_return
的最佳值或最差值的时间段。
可以简单地计算所有可能组合的 rate_of_return
,然后创建一个包含 period_start
period_end
和 rate_of_return
的数据框,并按降序(最佳)或升序排序(最差)顺序,然后排除任何重叠的时期。
rate_of_return = ((period_end_price/period_start_price)^(compounding_days/(days_in_between))-1 * (normalisation_days/compounding_days)
在上面的数据框中,我使用下面的代码计算了 rate_of_return
df['rate_of_return_l1'] = ((((df.Price /
df.Price[0]) **
(compounding_days /
(df.Date - df.Date[0]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l1'].iloc[0] = np.nan
df['rate_of_return_l2'] = ((((df.Price /
df.Price[1]) **
(compounding_days /
(df.Date - df.Date[1]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l2'].iloc[:2] = np.nan
df['rate_of_return_l3'] = ((((df.Price /
df.Price[2]) **
(compounding_days /
(df.Date - df.Date[2]).dt.days) - 1) *
(normalisation_days /
compounding_days)))
df['rate_of_return_l3'].iloc[:3] = np.nan
根据结果,best/worst 个案例周期如下
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02 | 2021-08-03 | 18.28751739 |
| 2021-08-02 | 2021-08-04 | 0.784586925 |
| 2021-07-30 | 2021-08-03 | 0.729942907 |
| 2021-07-30 | 2021-08-04 | 0.081397181 |
| 2021-07-30 | 2021-08-02 | -0.225626914 |
| 2021-08-03 | 2021-08-04 | -0.834880227 |
+--------------+------------+----------------+
预期输出
如果我想看到 rate_of_return
中最好的结果数据帧将是
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02 | 2021-08-03 | 18.28751739 |
+--------------+------------+----------------+
如果我想查看 rate_of_return
中最差的情况,则生成的数据帧将是
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03 | 2021-08-04 | -0.834880227 |
| 2021-07-30 | 2021-08-02 | -0.225626914 |
+--------------+------------+----------------+
- 我们测试所有场景以进行计算的最佳方法是什么
rate_of_return
? - 我怎样才能达到预期的产出,使周期不重叠? (在预期输出中看到)
- Best/Worst 数据帧不依赖于符号最好的数据帧可以包含负数
rate_of_returns
假设没有时间段重叠。 - 如果公式更改为
(period_end_price/period_start_price) - 1
(不依赖于时间),方法是什么?
定义你的函数,你可以直接传递数据框和开始、结束日期:
import numpy as np
import pandas as pd
data = {'Date': ['2021-07-30', '2021-08-02', '2021-08-03', '2021-08-04'],
'Price': [438.51, 437.59, 441.15, 438.98]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
normalisation_days = 365.25
compounding_days = 365.25
def rate_ret(df, start_date, end_date):
start = df[df.Date==start_date].iloc[0]
end = df[df.Date==end_date].iloc[0]
period_start_price = start.Price
period_end_price = end.Price
days_in_between = (end.Date - start.Date).days
return ((period_end_price/period_start_price)**(compounding_days/days_in_between)-1) * (normalisation_days/compounding_days)
# Iterate over all possible date intervals creating an array (or matrix),
#in the second `for` loop, we only include dates bigger than the starting date:
array = []
for start_date in df.Date:
for end_date in df.Date[df.Date>start_date]:
array.append([rate_ret(df, start_date, end_date), start_date, end_date])
print(array)
# To extract the best and the worst periods with no overlapping,
# take the best save it and iteratively save the next comparing if they collide or not with the previous stored intervals:
def extract_non_overlaping(df):
saved_rows = [df.iloc[0]]
for i,row in df.iterrows():
for saved in saved_rows:
if (row['Period End'] < saved['Period Start']) or (row['Period Start'] > saved['Period End']):
saved_rows.append(row)
break # avoid saving duplicates
return pd.DataFrame(saved_rows, columns=['Rate of Return','Period Start','Period End'])
df_higher = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'],ascending=False)
df_lower = pd.DataFrame(array, columns=['Rate of Return','Period Start','Period End']).reset_index(drop=True).sort_values(['Rate of Return'])
extract_non_overlaping(df_higher)
extract_non_overlaping(df_lower)
而结果较低:
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-02 | 2021-08-03 | 18.28751739 |
+--------------+------------+----------------+
更高:
+--------------+------------+----------------+
| Period Start | Period End | Rate of Return |
+--------------+------------+----------------+
| 2021-08-03 | 2021-08-04 | -0.834880227 |
| 2021-07-30 | 2021-08-02 | -0.225626914 |
+--------------+------------+----------------+
如果公式不依赖于时间,只需更改 rete_ret 定义中的公式即可。
pd:您可以进行一些优化,但总体而言代码有效。
如果我没理解错的话,你的问题分为两部分-
第 1 部分:生成组合
要生成组合,您可以使用 itertools
,计算每个组合的 returns 并对结果进行排序。
from itertools import combinations
rors = []
for combination in combinations(zip(df['Date'], df['Price']), 2):
(start_date, start_price), (end_date, end_price) = combination
ror = (end_price / start_price) ** (compounding_days / (end_date - start_date).days) - 1
rors.append((start_date, end_date, ror))
sorted_rors = sorted(rors, key=lambda x: x[2], reverse=True)
print(sorted_rors[0])
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(sorted_rors[-1])
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)
第 2 部分:非重叠时间段
这部分我不是很清楚,不过我猜你是想找top nreturns 具有不重叠的时间段。如果你看的时间段数比较多,可以考虑使用生成器函数 -
def next_non_overlapping(iterable):
it = iter(iterable)
first_start, first_end, first_ror = next(it)
yield (first_start, first_end, first_ror)
while True:
try:
next_start, next_end, next_ror = next(it)
if next_start >= first_end or next_end <= first_start:
yield (next_start, next_end, next_ror)
first_start, first_end, first_ror = next_start, next_end, next_ror
except StopIteration:
print("No more items")
break
nno = next_non_overlapping(sorted_rors)
print(next(nno))
#(Timestamp('2021-08-02 00:00:00'),
# Timestamp('2021-08-03 00:00:00'),
# 18.28751738702541)
print(next(nno))
#(Timestamp('2021-07-30 00:00:00'),
# Timestamp('2021-08-02 00:00:00'),
# -0.22562691374181088)
print(next(nno))
#(Timestamp('2021-08-03 00:00:00'),
# Timestamp('2021-08-04 00:00:00'),
# -0.8348802270491325)
print(next(nno))
# No more items
为了获得最低的 n returns,您可以简单地将反向列表传递给函数 - 即
nnor = next_non_overlapping(reversed(sorted_rors))
首先,如果时间序列是每日的,问题就更容易了。所以我会这样做:
df.set_index('Date').resample('d').mean().reset_index()
这让我们:
Date | Price | |
---|---|---|
0 | 2021-07-30 00:00:00 | 438.51 |
1 | 2021-07-31 00:00:00 | nan |
2 | 2021-08-01 00:00:00 | nan |
3 | 2021-08-02 00:00:00 | 437.59 |
4 | 2021-08-03 00:00:00 | 441.15 |
5 | 2021-08-04 00:00:00 | 438.98 |
从这里您可以计算出未来 return 到 x-days 的比率:
for holding_duration in range(1, 5):
df[holding_duration] = df['Price'].pct_change(holding_duration).add(1).pow(365.25/holding_duration)
这给出:
Date | Price | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|---|
0 | 2021-07-30 00:00:00 | 438.51 | nan | nan | nan | nan |
1 | 2021-07-31 00:00:00 | nan | nan | nan | nan | nan |
2 | 2021-08-01 00:00:00 | nan | nan | nan | nan | nan |
3 | 2021-08-02 00:00:00 | 437.59 | 0.464356 | nan | nan | nan |
4 | 2021-08-03 00:00:00 | 441.15 | 19.2875 | 2.9927 | nan | nan |
5 | 2021-08-04 00:00:00 | 438.98 | 0.16512 | 1.78459 | 1.13931 | nan |
这可能会变得相当大...
从那里你可以做一个 row-wise argmax 并从中推导出持有期。
不是完整的解决方案,但也许有帮助。