计算大型数据集 pandas 中的十分位数排名

Caclulate decile ranking in pandas for a large dataset

查看以下数据集:

Date ticker overnight_return
2017-07-20 CLXT 0.019556
2017-07-21 CLXT 0.039778
2022-02-14 ETNB -0.006186
2022-02-15 ETNB 0.024590

我目前正在检验基于隔夜 return 因素的假设。我想首先对每个 Dateticker 列中的所有唯一值应用排名。然后是排名的 z 评分。最后,我想按照十分位数对它们进行排名。

我使用以下代码获得了一个日期的最终 z 分数:

import scipy.stats as stats
stats.zscore(equity_daily[equity_daily.Date == "2017-07-20"].overnight_return.rank().dropna().values)

现在我想根据当天所有代码的排名获得每一天的 z 分数。

我的方法是获取旋转 table,然后创建一个包含 z 分数的新 table。

equity_daily.pivot(columns = "ticker", values = "overnight_return", index = "Date")

但是出现如下错误:

ValueError: Index contains duplicate entries, cannot reshape

期望的结果

Date ticker overnight_return Decile_rank
2017-07-20 CLXT 0.019556 0
2017-07-21 CLXT 0.039778 2
2022-02-14 ETNB -0.006186 9
2022-02-15 ETNB 0.024590 8

没有更多的数据样本,很难自己测试,但是...

尝试 pivot_table() 而不是 pivot()pivot 不进行聚合

from alphalens.tears import (create_returns_tear_sheet,
                      create_information_tear_sheet,
                      create_turnover_tear_sheet,
                      create_summary_tear_sheet,
                      create_full_tear_sheet,
                      create_event_returns_tear_sheet,
                      create_event_study_tear_sheet)

from alphalens.utils import get_clean_factor_and_forward_returns

def z_score(x):
    """Helper function for Normalization"""
    return stats.zscore(x)

equity_daily["overnight_rank"] = equity_daily.groupby("Date")["overnight_return"].rank(method = "first")
equity_daily["overnight_normalized"] = equity_daily.groupby("Date")["overnight_rank"].apply(z_score)
equity_daily["overnight_normalized"] = equity_daily.overnight_normalized.shift(-1)
equity_daily = equity_daily.dropna()

factor = equity_daily[["Date", "ticker", "overnight_normalized"]].\
                groupby([pd.Grouper(key = "Date"), "ticker"]).sum()

prices = equity_daily.pivot(columns = "ticker", values = "Close", index = "Date")

factor_data = get_clean_factor_and_forward_returns(
    factor = factor,
    prices = prices,
    groupby = None,
    binning_by_group = False,
    quantiles = 10,
    bins = None,
    periods = (1, 5, 10),
    filter_zscore = 20,
    groupby_labels = None,
    max_loss = 0.35
)