多列熊猫的滚动相关
Rolling Correlation of Multi-Column Panda
我正在尝试计算并可视化 180(本例中为 3)天内多列之间的滚动相关性 window。
我的数据格式是这样的(在原始文件中有 12 列加上时间戳和数千行):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
我不能只使用这个公式,因为如果我这样做会因为时间戳列而出现格式错误:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
当我删除“时间戳”列时,每个单元格的结果都是 1.0,这也不对,而且我丢失了最终可视化图表所需的时间戳。
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
有没有经验如何与多列和数据索引进行滚动关联?
您或许可以像这样计算成对相关性,而不是一次计算所有 3 个相关性。
获得相关性后,您也可以直接将它们添加为您的列,同时保留时间戳。
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```
在 的基础上,我提出以下建议。它应该适用于任意数量的列:
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
开头的 NaN 值是窗口化的结果。
我想还有另一种方法。
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
对我来说,我觉得比之前的回答简单了一点。
我正在尝试计算并可视化 180(本例中为 3)天内多列之间的滚动相关性 window。
我的数据格式是这样的(在原始文件中有 12 列加上时间戳和数千行):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
我不能只使用这个公式,因为如果我这样做会因为时间戳列而出现格式错误:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
当我删除“时间戳”列时,每个单元格的结果都是 1.0,这也不对,而且我丢失了最终可视化图表所需的时间戳。
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
有没有经验如何与多列和数据索引进行滚动关联?
您或许可以像这样计算成对相关性,而不是一次计算所有 3 个相关性。
获得相关性后,您也可以直接将它们添加为您的列,同时保留时间戳。
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```
在
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
开头的 NaN 值是窗口化的结果。
我想还有另一种方法。
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
对我来说,我觉得比之前的回答简单了一点。