如何将 pd.concat() 合并到我的 for-loops 中以加快计算速度?
How to incorporate pd.concat() into my for-loops for much faster computing?
[背景信息开始]
我正在研究固定太阳能电池板的优化。我有几年来每隔 15 分钟收集一次的真实辐照度数据。
我是 运行 python 中的一个程序,该程序应该测试太阳能电池板倾斜和方位角的不同情况,以最大限度地提高太阳的辐照度。如果您好奇或不知道 PV 术语:
- 你可以把tilt看成是面板“俯仰”的角度(面板直向上就是0度,正对horizon将是 90 度)
- 您可以将 方位角 视为太阳能电池板“转动”的角度,范围为 0-360 度(北=0/360,东=90,南=180 , 西=270).
下面是一个简单的案例,我在不同的 方位角 仅测试了 一天的读数 的能量增益百分比。在这个简单的例子中,我以 45 度间隔测试方位角:90(东)、135(东南)、180(南)、225(西南)和 270(西)度。 这段代码并不完美,但它确实有效。
代码:
# For the AFTERNOON THUNDERSTORM Dataset from above:
# Now, I will model the gain in energy due to transposition from GHI --> POA
#NOTE: Changing only AZIMUTH ANGLE at Fixed Tilt
df_hyp_gain_az = pd.DataFrame()
def calculate_poa_hyp(rawdata,solar_position,surface_tilt,surface_azimuth):
poa = pvlib.irradiance.get_total_irradiance(
surface_tilt=surface_tilt,
surface_azimuth=surface_azimuth,
dni=dirint_dni_hyp, # calculated from before
ghi=df_hyp['Solar Radiation(W/m^2)'], # this is the raw data
dhi=calculated_dhi_hyp, # calculated from before
dni_extra=dni_et_hyp, # calculated from before
solar_zenith=solpos_hyp['apparent_zenith'], # calculated from before
solar_azimuth=solpos_hyp['azimuth'], # calculated from before
surface_type='grass',
model='haydavies')
return poa['poa_global'] # returns the total in-plane irradiance
for azimuth in range(90,271,45): # scans from east(90) to west(270)
# NOTE: Hardcoding Tilt=FLAT for all cases
poa_irradiance_hyp_az = calculate_poa_hyp(
rawdata=df_hyp,
solar_position=solpos_hyp,
surface_tilt=alma.latitude,
surface_azimuth=azimuth)
column_name_hyp_az = f"AZ-{azimuth}"
df_hyp_gain_az[column_name_hyp_az] = poa_irradiance_hyp_az
# calculate the % difference from GHI
ghi_hyp = df_hyp['Solar Radiation(W/m^2)']
df_hyp_gain_az = 100 * (df_hyp_gain_az.divide(ghi_hyp, axis=0)-1)
plt.figure()
df_hyp_gain_az.plot().get_figure().set_facecolor('white')
plt.xlabel('Hour of Day')
plt.ylabel('Hourly Transposition Gain [%]')
plt.title('Aftn. Thunderstorm - Energy Gain From Changing Surface Azimuth',size='x-large',weight='demibold');
plt.xlim('1997-07-07 06:00:00-04:00','1997-07-07 21:00:00-04:00');
输出:
Energy Gain From Changing Surface Azimuth
我还同时测试了倾斜和方位角,倾斜间隔为 10 度,方位角为 45-学位间隔。为了找到倾斜和方位角的哪个方向会 return 最高的辐照度,我成功地使用了 from scipy import integrate
。每个案例的两个 nested-for 循环和集成效果很好,这里没有问题:
代码:
# Integrating over both change in AZIMUTH and TILT
df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain
for tilt in range(0,91,10):
for azimuth in range(0,316,45):
poa_irradiance_hyp_both = calculate_poa_hyp(
rawdata=df_hyp,
solar_position=solpos_hyp,
surface_tilt=tilt,
surface_azimuth=azimuth)
column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both
df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)
df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))
df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,10):
for azimuth in range(0,316,45):
poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
df_hyp_poa[column_name_poa_hyp] = poa_hyp
df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')
输出:
Integrating the flat POA Irradiance [W/m^2]:
AZ=90|FT=40 2.075620e+07
AZ=90|FT=50 2.055457e+07
AZ=90|FT=30 2.048603e+07
AZ=90|FT=60 1.988685e+07
AZ=90|FT=20 1.975235e+07
...
AZ=270|FT=80 5.838635e+06
AZ=315|FT=80 5.648596e+06
AZ=225|FT=90 5.409111e+06
AZ=270|FT=90 5.291405e+06
AZ=315|FT=90 5.225395e+06
Length: 79, dtype: float64
--------------------------------------------
Integrating the Energy Gain [%]:
AZ=90|FT=50 1.218214e+06
AZ=90|FT=40 1.206950e+06
AZ=90|FT=60 1.107202e+06
AZ=90|FT=30 1.074007e+06
AZ=45|FT=40 9.597149e+05
...
AZ=315|FT=80 -2.598201e+06
AZ=180|FT=90 -2.720371e+06
AZ=225|FT=90 -2.777403e+06
AZ=270|FT=90 -2.790663e+06
AZ=315|FT=90 -2.802186e+06
Length: 79, dtype: float64
--------------------------------------------
Azimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.
上面的代码不到 10 秒就可以正确执行,因为它们是针对一天的数据。
[背景信息结束]
现在,进入我的问题: 因为我只分析了 tilt-angles 10 度间隔和 azimuth-angles 45 度间隔,我想要更多准确的结果。因此,我降低了倾角和方位角间隔以每 1 度进行分析。
代码(与之前代码的唯一区别是更改了 range()
参数):
df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain
for tilt in range(0,91,1):
for azimuth in range(0,360,1):
poa_irradiance_hyp_both = calculate_poa_hyp(
rawdata=df_hyp,
solar_position=solpos_hyp,
surface_tilt=tilt,
surface_azimuth=azimuth)
column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both
df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)
df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))
df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,1):
for azimuth in range(0,360,1):
poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
df_hyp_poa[column_name_poa_hyp] = poa_hyp
df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')
输出:
C:\Users\jmand\AppData\Local\Temp\ipykernel_1057687560089.py:13: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both
并且此输出重复 over-and-over 很长时间没有结果。我尝试在该网站上搜索解决方案(发现 PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance 和许多类似的)。我知道我需要以某种形式或方式为我的 df_hyp_gain_both
DataFrame 使用 pd.concat()
。但是,我什至无法设置它。我需要以某种方式使用 column_name_hyp_both
AND poa_irradiance_hyp_both
.
使用正确的语法,我如何将 pd.concat()
合并到我的 for-loop
中以避免此 PerformanceWarning: DataFrame is highly fragmented
警告?
去掉所有太阳能的东西后,问题基本上归结为:
import pandas as pd
df = pd.DataFrame()
dummy_data = pd.Series(0, index=pd.date_range('2019-01-01', freq='h', periods=8760))
for i in range(200): # 200 just as an example
df[f'col_{i}'] = dummy_data.copy()
无论如何,在我的计算机上,一个快一两个数量级的替代方法是将列累积到字典中,并且只在循环后转换为 DataFrame:
results = {}
for i in range(200):
results[f'col_{i}'] = dummy_data.copy()
df = pd.DataFrame(results)
当按行(而不是像上面那样按列)构建 DataFrame 时,类似的方法很有用——与其将行附加到 DataFrame,这会强制进行不必要的重新分配和内存复制,不如将行信息累积在一次列出并转换为 DataFrame。例如,考虑这三种从行块构建 DataFrame 的方法:
empty_df = pd.DataFrame({i: [0]*10 for i in range(20)})
def pandas_concat(N):
df = empty_df
for i in range(1, N):
df = pd.concat([df, empty_df])
return df
def pandas_append(N):
df = empty_df
for i in range(1, N):
df = df.append(empty_df)
return df
def list_append(N):
lis = []
for i in range(N):
lis.append(empty_df)
df = pd.concat(lis)
return df
以下是时间作为 N
函数的比较方式(点表示时间,线是估计的渐近行为)。因此,通过在正常 python 数据结构中积累东西并在最后只构建一次最终的 DataFrame 可以明显加快速度。
[背景信息开始]
我正在研究固定太阳能电池板的优化。我有几年来每隔 15 分钟收集一次的真实辐照度数据。
我是 运行 python 中的一个程序,该程序应该测试太阳能电池板倾斜和方位角的不同情况,以最大限度地提高太阳的辐照度。如果您好奇或不知道 PV 术语:
- 你可以把tilt看成是面板“俯仰”的角度(面板直向上就是0度,正对horizon将是 90 度)
- 您可以将 方位角 视为太阳能电池板“转动”的角度,范围为 0-360 度(北=0/360,东=90,南=180 , 西=270).
下面是一个简单的案例,我在不同的 方位角 仅测试了 一天的读数 的能量增益百分比。在这个简单的例子中,我以 45 度间隔测试方位角:90(东)、135(东南)、180(南)、225(西南)和 270(西)度。 这段代码并不完美,但它确实有效。
代码:
# For the AFTERNOON THUNDERSTORM Dataset from above:
# Now, I will model the gain in energy due to transposition from GHI --> POA
#NOTE: Changing only AZIMUTH ANGLE at Fixed Tilt
df_hyp_gain_az = pd.DataFrame()
def calculate_poa_hyp(rawdata,solar_position,surface_tilt,surface_azimuth):
poa = pvlib.irradiance.get_total_irradiance(
surface_tilt=surface_tilt,
surface_azimuth=surface_azimuth,
dni=dirint_dni_hyp, # calculated from before
ghi=df_hyp['Solar Radiation(W/m^2)'], # this is the raw data
dhi=calculated_dhi_hyp, # calculated from before
dni_extra=dni_et_hyp, # calculated from before
solar_zenith=solpos_hyp['apparent_zenith'], # calculated from before
solar_azimuth=solpos_hyp['azimuth'], # calculated from before
surface_type='grass',
model='haydavies')
return poa['poa_global'] # returns the total in-plane irradiance
for azimuth in range(90,271,45): # scans from east(90) to west(270)
# NOTE: Hardcoding Tilt=FLAT for all cases
poa_irradiance_hyp_az = calculate_poa_hyp(
rawdata=df_hyp,
solar_position=solpos_hyp,
surface_tilt=alma.latitude,
surface_azimuth=azimuth)
column_name_hyp_az = f"AZ-{azimuth}"
df_hyp_gain_az[column_name_hyp_az] = poa_irradiance_hyp_az
# calculate the % difference from GHI
ghi_hyp = df_hyp['Solar Radiation(W/m^2)']
df_hyp_gain_az = 100 * (df_hyp_gain_az.divide(ghi_hyp, axis=0)-1)
plt.figure()
df_hyp_gain_az.plot().get_figure().set_facecolor('white')
plt.xlabel('Hour of Day')
plt.ylabel('Hourly Transposition Gain [%]')
plt.title('Aftn. Thunderstorm - Energy Gain From Changing Surface Azimuth',size='x-large',weight='demibold');
plt.xlim('1997-07-07 06:00:00-04:00','1997-07-07 21:00:00-04:00');
输出: Energy Gain From Changing Surface Azimuth
我还同时测试了倾斜和方位角,倾斜间隔为 10 度,方位角为 45-学位间隔。为了找到倾斜和方位角的哪个方向会 return 最高的辐照度,我成功地使用了 from scipy import integrate
。每个案例的两个 nested-for 循环和集成效果很好,这里没有问题:
代码:
# Integrating over both change in AZIMUTH and TILT
df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain
for tilt in range(0,91,10):
for azimuth in range(0,316,45):
poa_irradiance_hyp_both = calculate_poa_hyp(
rawdata=df_hyp,
solar_position=solpos_hyp,
surface_tilt=tilt,
surface_azimuth=azimuth)
column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both
df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)
df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))
df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,10):
for azimuth in range(0,316,45):
poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
df_hyp_poa[column_name_poa_hyp] = poa_hyp
df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')
输出:
Integrating the flat POA Irradiance [W/m^2]:
AZ=90|FT=40 2.075620e+07
AZ=90|FT=50 2.055457e+07
AZ=90|FT=30 2.048603e+07
AZ=90|FT=60 1.988685e+07
AZ=90|FT=20 1.975235e+07
...
AZ=270|FT=80 5.838635e+06
AZ=315|FT=80 5.648596e+06
AZ=225|FT=90 5.409111e+06
AZ=270|FT=90 5.291405e+06
AZ=315|FT=90 5.225395e+06
Length: 79, dtype: float64
--------------------------------------------
Integrating the Energy Gain [%]:
AZ=90|FT=50 1.218214e+06
AZ=90|FT=40 1.206950e+06
AZ=90|FT=60 1.107202e+06
AZ=90|FT=30 1.074007e+06
AZ=45|FT=40 9.597149e+05
...
AZ=315|FT=80 -2.598201e+06
AZ=180|FT=90 -2.720371e+06
AZ=225|FT=90 -2.777403e+06
AZ=270|FT=90 -2.790663e+06
AZ=315|FT=90 -2.802186e+06
Length: 79, dtype: float64
--------------------------------------------
Azimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.
上面的代码不到 10 秒就可以正确执行,因为它们是针对一天的数据。
[背景信息结束]
现在,进入我的问题: 因为我只分析了 tilt-angles 10 度间隔和 azimuth-angles 45 度间隔,我想要更多准确的结果。因此,我降低了倾角和方位角间隔以每 1 度进行分析。
代码(与之前代码的唯一区别是更改了 range()
参数):
df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain
for tilt in range(0,91,1):
for azimuth in range(0,360,1):
poa_irradiance_hyp_both = calculate_poa_hyp(
rawdata=df_hyp,
solar_position=solpos_hyp,
surface_tilt=tilt,
surface_azimuth=azimuth)
column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both
df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)
df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))
df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,1):
for azimuth in range(0,360,1):
poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
df_hyp_poa[column_name_poa_hyp] = poa_hyp
df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')
输出:
C:\Users\jmand\AppData\Local\Temp\ipykernel_1057687560089.py:13: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both
并且此输出重复 over-and-over 很长时间没有结果。我尝试在该网站上搜索解决方案(发现 PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance 和许多类似的)。我知道我需要以某种形式或方式为我的 df_hyp_gain_both
DataFrame 使用 pd.concat()
。但是,我什至无法设置它。我需要以某种方式使用 column_name_hyp_both
AND poa_irradiance_hyp_both
.
使用正确的语法,我如何将 pd.concat()
合并到我的 for-loop
中以避免此 PerformanceWarning: DataFrame is highly fragmented
警告?
去掉所有太阳能的东西后,问题基本上归结为:
import pandas as pd
df = pd.DataFrame()
dummy_data = pd.Series(0, index=pd.date_range('2019-01-01', freq='h', periods=8760))
for i in range(200): # 200 just as an example
df[f'col_{i}'] = dummy_data.copy()
无论如何,在我的计算机上,一个快一两个数量级的替代方法是将列累积到字典中,并且只在循环后转换为 DataFrame:
results = {}
for i in range(200):
results[f'col_{i}'] = dummy_data.copy()
df = pd.DataFrame(results)
当按行(而不是像上面那样按列)构建 DataFrame 时,类似的方法很有用——与其将行附加到 DataFrame,这会强制进行不必要的重新分配和内存复制,不如将行信息累积在一次列出并转换为 DataFrame。例如,考虑这三种从行块构建 DataFrame 的方法:
empty_df = pd.DataFrame({i: [0]*10 for i in range(20)})
def pandas_concat(N):
df = empty_df
for i in range(1, N):
df = pd.concat([df, empty_df])
return df
def pandas_append(N):
df = empty_df
for i in range(1, N):
df = df.append(empty_df)
return df
def list_append(N):
lis = []
for i in range(N):
lis.append(empty_df)
df = pd.concat(lis)
return df
以下是时间作为 N
函数的比较方式(点表示时间,线是估计的渐近行为)。因此,通过在正常 python 数据结构中积累东西并在最后只构建一次最终的 DataFrame 可以明显加快速度。