插值和将输出附加到列表永远在 for 循环中

Question

我在此处附加了一个名为 'df2.xlsx' 的测试 excel 文件：https://docs.google.com/spreadsheets/d/1U55lXyZSYguiQUH0AOB_v8yhKcbMGNQs/edit?usp=sharing&ouid=102781316443126205856&rtpof=true&sd=true 有 58673 行，我已将其导入为数据框并使用以下附加代码计算 'D50' 通过 interp1d 线性插值。 D50 是 50 个百分位值，这就是我需要插值的原因。我用来插入的列是 con13c、con12c、con2c、con3c、......、con11c、con14c。 con13c 和 con14c 的索引是 17 和 29。我使用 append() 将输出存储到一个空列表中。但是，代码的性能很慢。

主 excel file/text 文件将有 4928526 而不是附加的 excel 文件 58673 行，完成主 [= 的 D50 计算需要 20 多分钟23=] 文件。让我知道是否有一种方法可以通过逐块读取 df 块和运行到 multiprocessor.In 主 excel 文件来加快速度，将有数百个不同的 TS 值，并且每个TS值会有58673行。所以在测试 excel 文件 'df2.xlsx' 中，所有数据仅针对一个特定的 TS。谢谢

import pandas as pd
import numpy as np
from scipy.interpolate import interp1d


dt=pd.read_excel('df2.xlsx', index_col=0) 

# check column index
dt.columns.get_loc("con14c")
x=[0.00001, 0.00004675, 0.000088,   0.000177,   0.000354,   0.000707,   0.001414,   0.002828,   
                   0.005657,    0.011314,   0.022627,   0.045254,   0.6096]
x=np.array(x)
xx=np.log(x)
dfs =[]
for i in range(0, len(dt)): # loop through the rows of dt
    y1=dt.iloc[i,17:30]

    y1=np.array(y1,dtype=np.float)
    f = interp1d( y1,xx, kind='linear', bounds_error=False, fill_value=np.log(y1[0])) #fill_value='extrapolate'
    x_new=np.exp(f(.5))
    print(np.exp(x_new))
    dfs.append(x_new)
dt['D50']=dfs

Answer 1

我运行在我的 PC 上进行了测试，通过简单的更改，它减少了原始运行时间的 30%。之前需要9秒，现在只需要2秒左右。

删除印刷品
不要索引 pd.DataFrame，因为它非常慢。首先将其转换为 numpy 数组并对其进行索引：

# outside the for loop
dt_arr = dt.values

# ... other codes

y1 = dt_arr[i,17:30]

因为您只对 0.5 感兴趣：

dfs1 = []
dfs2 = []
for i in range(0, len(dt)): # loop through the rows of dt
    y1=dt_arr[i, 17:30]

    y1=np.array(y1,dtype=np.float)
    f = interp1d( y1,xx, kind='linear', bounds_error=False, fill_value=np.log(y1[0])) #fill_value='extrapolate'
    x_new=np.exp(f(.5))
    dfs1.append(x_new)
    
    # I don't know if your data is sorted, if so you can ignore this part
    sort_idx = np.argsort(y1)
    xx_sorted = xx[sort_idx]
    y1_sorted = y1[sort_idx]
    # I think your fill value is a bit weird as you are using same values for both ends. You might want to check that
    if y1_sorted[-1] < 0.5 or y1_sorted[0] > 0.5:
        dfs2.append(y1[0])
    else:
        idx = np.argmax(y1_sorted > 0.5)
        x0 = xx_sorted[idx-1]
        x1 = xx_sorted[idx]
        z0 = y1_sorted[idx-1]
        z1 = y1_sorted[idx]
        dfs2.append(np.exp(x0 + (0.5-z0)*(x1-x0)/(z1-z0)))

插值和将输出附加到列表永远在 for 循环中

interpolation and appending outputs to list is taking forever in for loop

python

parallel-processing

performance

append

linear-interpolation