如何将 pd.concat() 合并到我的 for-loops 中以加快计算速度？

Question

[背景信息开始]

我正在研究固定太阳能电池板的优化。我有几年来每隔 15 分钟收集一次的真实辐照度数据。

我是运行 python 中的一个程序，该程序应该测试太阳能电池板倾斜和方位角的不同情况，以最大限度地提高太阳的辐照度。如果您好奇或不知道 PV 术语：

你可以把tilt看成是面板“俯仰”的角度（面板直向上就是0度，正对horizon将是 90 度）
您可以将 方位角 视为太阳能电池板“转动”的角度，范围为 0-360 度（北=0/360，东=90，南=180 , 西=270).

下面是一个简单的案例，我在不同的 方位角 仅测试了 一天的读数 的能量增益百分比。在这个简单的例子中，我以 45 度间隔测试方位角：90（东）、135（东南）、180（南）、225（西南）和 270（西）度。 这段代码并不完美，但它确实有效。

代码：

# For the AFTERNOON THUNDERSTORM Dataset from above:
# Now, I will model the gain in energy due to transposition from GHI --> POA
    #NOTE: Changing only AZIMUTH ANGLE at Fixed Tilt

df_hyp_gain_az = pd.DataFrame()

def calculate_poa_hyp(rawdata,solar_position,surface_tilt,surface_azimuth):
    poa = pvlib.irradiance.get_total_irradiance(
        surface_tilt=surface_tilt,
        surface_azimuth=surface_azimuth,
        dni=dirint_dni_hyp, # calculated from before
        ghi=df_hyp['Solar Radiation(W/m^2)'], # this is the raw data
        dhi=calculated_dhi_hyp, # calculated from before
        dni_extra=dni_et_hyp, # calculated from before
        solar_zenith=solpos_hyp['apparent_zenith'], # calculated from before
        solar_azimuth=solpos_hyp['azimuth'], # calculated from before
        surface_type='grass',
        model='haydavies')
    return poa['poa_global'] # returns the total in-plane irradiance

for azimuth in range(90,271,45): # scans from east(90) to west(270)
    # NOTE: Hardcoding Tilt=FLAT for all cases
    poa_irradiance_hyp_az = calculate_poa_hyp(
        rawdata=df_hyp,
        solar_position=solpos_hyp,
        surface_tilt=alma.latitude,
        surface_azimuth=azimuth)
    column_name_hyp_az = f"AZ-{azimuth}"
    df_hyp_gain_az[column_name_hyp_az] = poa_irradiance_hyp_az

# calculate the % difference from GHI
ghi_hyp = df_hyp['Solar Radiation(W/m^2)']
df_hyp_gain_az = 100 * (df_hyp_gain_az.divide(ghi_hyp, axis=0)-1)

plt.figure()
df_hyp_gain_az.plot().get_figure().set_facecolor('white')
plt.xlabel('Hour of Day')
plt.ylabel('Hourly Transposition Gain [%]')
plt.title('Aftn. Thunderstorm - Energy Gain From Changing Surface Azimuth',size='x-large',weight='demibold');
plt.xlim('1997-07-07 06:00:00-04:00','1997-07-07 21:00:00-04:00');

输出： Energy Gain From Changing Surface Azimuth

我还同时测试了倾斜和方位角，倾斜间隔为 10 度，方位角为 45-学位间隔。为了找到倾斜和方位角的哪个方向会 return 最高的辐照度，我成功地使用了 from scipy import integrate。每个案例的两个 nested-for 循环和集成效果很好，这里没有问题：

代码：

# Integrating over both change in AZIMUTH and TILT

df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain

for tilt in range(0,91,10): 
    for azimuth in range(0,316,45):
        poa_irradiance_hyp_both = calculate_poa_hyp(
            rawdata=df_hyp,
            solar_position=solpos_hyp,
            surface_tilt=tilt,
            surface_azimuth=azimuth)
        column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
        df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both

df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)

df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))

df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,10):
    for azimuth in range(0,316,45):
        poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
        column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
        df_hyp_poa[column_name_poa_hyp] = poa_hyp

df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')

输出：

Integrating the flat POA Irradiance [W/m^2]:
AZ=90|FT=40     2.075620e+07
AZ=90|FT=50     2.055457e+07
AZ=90|FT=30     2.048603e+07
AZ=90|FT=60     1.988685e+07
AZ=90|FT=20     1.975235e+07
                    ...     
AZ=270|FT=80    5.838635e+06
AZ=315|FT=80    5.648596e+06
AZ=225|FT=90    5.409111e+06
AZ=270|FT=90    5.291405e+06
AZ=315|FT=90    5.225395e+06
Length: 79, dtype: float64
--------------------------------------------
Integrating the Energy Gain [%]:
AZ=90|FT=50     1.218214e+06
AZ=90|FT=40     1.206950e+06
AZ=90|FT=60     1.107202e+06
AZ=90|FT=30     1.074007e+06
AZ=45|FT=40     9.597149e+05
                    ...     
AZ=315|FT=80   -2.598201e+06
AZ=180|FT=90   -2.720371e+06
AZ=225|FT=90   -2.777403e+06
AZ=270|FT=90   -2.790663e+06
AZ=315|FT=90   -2.802186e+06
Length: 79, dtype: float64
--------------------------------------------
Azimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.

上面的代码不到 10 秒就可以正确执行，因为它们是针对一天的数据。

[背景信息结束]

现在，进入我的问题： 因为我只分析了 tilt-angles 10 度间隔和 azimuth-angles 45 度间隔，我想要更多准确的结果。因此，我降低了倾角和方位角间隔以每 1 度进行分析。

代码（与之前代码的唯一区别是更改了 range() 参数）：

df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain

for tilt in range(0,91,1): 
    for azimuth in range(0,360,1):
        poa_irradiance_hyp_both = calculate_poa_hyp(
            rawdata=df_hyp,
            solar_position=solpos_hyp,
            surface_tilt=tilt,
            surface_azimuth=azimuth)
        column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
        df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both

df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)

df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))

df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,1):
    for azimuth in range(0,360,1):
        poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
        column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
        df_hyp_poa[column_name_poa_hyp] = poa_hyp

df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')

输出：

C:\Users\jmand\AppData\Local\Temp\ipykernel_1057687560089.py:13: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both

并且此输出重复 over-and-over 很长时间没有结果。我尝试在该网站上搜索解决方案（发现 PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance 和许多类似的）。我知道我需要以某种形式或方式为我的 df_hyp_gain_both DataFrame 使用 pd.concat()。但是，我什至无法设置它。我需要以某种方式使用 column_name_hyp_both AND poa_irradiance_hyp_both.

使用正确的语法，我如何将 pd.concat() 合并到我的 for-loop 中以避免此 PerformanceWarning: DataFrame is highly fragmented 警告？

Answer 1

去掉所有太阳能的东西后，问题基本上归结为：

import pandas as pd

df = pd.DataFrame()
dummy_data = pd.Series(0, index=pd.date_range('2019-01-01', freq='h', periods=8760))

for i in range(200):  # 200 just as an example
    df[f'col_{i}'] = dummy_data.copy()

无论如何，在我的计算机上，一个快一两个数量级的替代方法是将列累积到字典中，并且只在循环后转换为 DataFrame：

results = {}
for i in range(200):
    results[f'col_{i}'] = dummy_data.copy()

df = pd.DataFrame(results)

当按行（而不是像上面那样按列）构建 DataFrame 时，类似的方法很有用——与其将行附加到 DataFrame，这会强制进行不必要的重新分配和内存复制，不如将行信息累积在一次列出并转换为 DataFrame。例如，考虑这三种从行块构建 DataFrame 的方法：

empty_df = pd.DataFrame({i: [0]*10 for i in range(20)})

def pandas_concat(N):
    df = empty_df
    for i in range(1, N):
        df = pd.concat([df, empty_df])
    return df

def pandas_append(N):
    df = empty_df
    for i in range(1, N):
        df = df.append(empty_df)
    return df

def list_append(N):
    lis = []
    for i in range(N):
        lis.append(empty_df)
    df = pd.concat(lis)
    return df

以下是时间作为 N 函数的比较方式（点表示时间，线是估计的渐近行为）。因此，通过在正常 python 数据结构中积累东西并在最后只构建一次最终的 DataFrame 可以明显加快速度。

如何将 pd.concat() 合并到我的 for-loops 中以加快计算速度？

How to incorporate pd.concat() into my for-loops for much faster computing?

python

concatenation

dataframe

pandas

pvlib

[背景信息结束]