用矢量化替换两个复杂的 for 循环

Question

这是我的第一个问题，所以如果我做错了什么，请告诉我。

我有两个复杂的 for 循环可以工作，但是速度太慢了。我知道我应该使用矢量化来加快速度，但我不明白在我的例子中该怎么做。任何帮助将不胜感激。我的问题的背景是我需要计算一年内不同股票的每日价格数据的平方差和（SSD）。我需要为 1046 只股票列表中两只股票的每种可能组合计算此 SSD，这些股票保存在 pandas 数据框中（及其价格数据）。

到目前为止，我有两个 for 循环，它们只计算列表中第一只股票和所有其他股票的每种可能组合的 SSD。目前，我很乐意将这两个循环矢量化，以使它们更快。我已经尝试使用 while 循环或在函数中定义它们，但这并没有像我需要的那样提高速度。如果还有比矢量化更好的方法，请让我知道我走错了路。

我的数据框“formation_period_1_1991”，我从中提取价格数据基本上是这样的（其中“PERMNO”是单个股票的标识符）：

data = [['99000', 10], ['99000', 11], ['99000', 12],['98000', 3], ['98000', 2], ['98000', 5],['97000', 9], ['97000',11], ['97000', 10]]
formation_period_1_1991 = pd.DataFrame(data, columns = ['PERMNO', 'Price'])

然后我定义了一个矩阵来保存SSD的计算值：

Axis_for_SSD_Matrix = formation_period_1_1991["PERMNO"].unique().tolist()
SSD_Matrix = pd.DataFrame(index=np.arange(formation_period_1_1991["PERMNO"].nunique()), columns=np.arange(formation_period_1_1991["PERMNO"].nunique()))
SSD_Matrix.set_axis(Axis_for_SSD_Matrix, axis="index",inplace=True)
SSD_Matrix.set_axis(Axis_for_SSD_Matrix, axis="columns",inplace=True)

最后，我用两个 for 循环计算 SSD_Matrix 第一行的 SSD：

x=3# is equal to number of trading days
no_of_considered_shares =(formation_period_1_1991["PERMNO"].nunique())
j=1

for j in range(1,no_of_considered_shares):
    SSD_calc = 0
    i=0
    for i in range(0,x): #x is no_of_trading_days
        SSD_calc = SSD_calc + (formation_period_1_1991.iloc[i]["Price"]-formation_period_1_1991.iloc[i+x*j]["Price"])**2 
    SSD_Matrix.loc[formation_period_1_1991.iloc[0]["PERMNO"],formation_period_1_1991.iloc[x*j]["PERMNO"]]=SSD_calc

在我运行代码之后 SSD_Matrix 看起来像这样：

    index 99000 98000 97000
  0  99000  nan   179    5
  1  98000  nan   nan   nan
  2  97000  nan   nan   nan

到目前为止它按我想要的方式工作，但由于我的真实数据框“formation_period_1_1991”有 1046 只股票，每只股票有 253 个交易日，如果有人可以提供任何帮助，我将非常高兴提高这两个 for 循环的速度（我猜是通过向量化）。非常感谢！

Answer 1

这里是：

formation_period_1_1991.index = formation_period_1_1991.index % formation_period_1_1991['PERMNO'].unique().shape[0]
df = formation_period_1_1991.pivot(columns='PERMNO', values='Price')
arr = df.to_numpy()

def combinations(arr):
    n = arr.shape[0]
    upper = np.tri(n,n,-1,dtype='bool').T
    a,b = np.meshgrid(arr,arr)
    return b[upper].reshape(-1), a[upper].reshape(-1)

n = arr.shape[1]
a,b = combinations(np.arange(n))

out = np.zeros((n,n))
out[a,b] = ((arr[:,a]-arr[:,b])**2).sum(axis=0)
out[b,a] = out[a,b]
out_df = pd.DataFrame(out)
out_df.columns = df.columns
out_df.index = df.columns.values
out_df

给我：

PERMNO  97000  98000  99000
97000     0.0  142.0    5.0
98000   142.0    0.0  179.0
99000     5.0  179.0    0.0

请注意，我实际上只计算矩阵的上三角。我只是假设下面的三角形看起来像上面的三角形，并且我们总是在对角线上有零。

用矢量化替换两个复杂的 for 循环

Replacing two convoluted for loops with vectorization

python

numpy

vectorization

pandas