如何获取pyspark数据框的相关矩阵?新 2020
How to get the correlation matrix of a pyspark data frame? NEW 2020
我对这个话题有同样的疑问:
"I have a big pyspark data frame. I want to get its correlation matrix. I know how to get it with a pandas data frame.But my data is too big to convert to pandas. So I need to get the result with pyspark data frame.I searched other similar questions, the answers don't work for me. Can any body help me? Thanks!"
df4是我的数据集,他有9列而且都是整数:
reference__YM_unix:integer
tenure_band:integer
cei_global_band:integer
x_band:integer
y_band:integer
limit_band:integer
spend_band:integer
transactions_band:integer
spend_total:integer
我先做了这一步:
# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df4.columns, outputCol=vector_col)
df_vector = assembler.transform(df4).select(vector_col)
# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)
并得到以下输出:
(matrix.collect()[0]["pearson({})".format(vector_col)].values)
Out[33]: array([ 1. , 0.0760092 , 0.09051543, 0.07550633, -0.08058203,
-0.24106848, 0.08229602, -0.02975856, -0.03108094, 0.0760092 ,
1. , 0.14792512, -0.10744735, 0.29481762, -0.04490072,
-0.27454922, 0.23242408, 0.32051685, 0.09051543, 0.14792512,
1. , -0.03708623, 0.13719527, -0.01135489, 0.08706559,
0.24713638, 0.37453265, 0.07550633, -0.10744735, -0.03708623,
1. , -0.49640664, 0.01885793, 0.25877516, -0.05019079,
-0.13878844, -0.08058203, 0.29481762, 0.13719527, -0.49640664,
1. , 0.01080777, -0.42319841, 0.01229877, 0.16440178,
-0.24106848, -0.04490072, -0.01135489, 0.01885793, 0.01080777,
1. , 0.00523737, 0.01244241, 0.01811365, 0.08229602,
-0.27454922, 0.08706559, 0.25877516, -0.42319841, 0.00523737,
1. , 0.32888075, 0.21416322, -0.02975856, 0.23242408,
0.24713638, -0.05019079, 0.01229877, 0.01244241, 0.32888075,
1. , 0.53310864, -0.03108094, 0.32051685, 0.37453265,
-0.13878844, 0.16440178, 0.01811365, 0.21416322, 0.53310864,
1. ])
我试图将此结果插入到数组或 excel 文件中,但没有成功。
我做到了:
matrix2 = (matrix.collect()[0]["pearson({})".format(vector_col)])
然后当我尝试显示此信息时出现以下错误:
display(matrix2)
Exception: ML model display does not yet support model type <class 'pyspark.ml.linalg.DenseMatrix'>.
我期待从 df4
插入列的名称,但它没有成功,我读到我需要使用 df4.columns 但我不知道它是如何工作的.
最后,我期待打印出我从媒体文章中看到的下图
但也没有用:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df4.iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 12},
cmap='coolwarm',
yticklabels=cols,
xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients', size = 18)
plt.tight_layout()
plt.show()
AttributeError: 'DataFrame' object has no attribute 'iloc'
我试过将 df4 替换为 matrix2,但没有成功
您可以使用以下方法以您可以操作的形式获取相关矩阵:
matrix = matrix.toArray().tolist()
从那里您可以转换为数据框 pd.DataFrame(matrix)
,这样您就可以绘制热图,或保存到 excel 等
我对这个话题有同样的疑问:
"I have a big pyspark data frame. I want to get its correlation matrix. I know how to get it with a pandas data frame.But my data is too big to convert to pandas. So I need to get the result with pyspark data frame.I searched other similar questions, the answers don't work for me. Can any body help me? Thanks!"
df4是我的数据集,他有9列而且都是整数:
reference__YM_unix:integer
tenure_band:integer
cei_global_band:integer
x_band:integer
y_band:integer
limit_band:integer
spend_band:integer
transactions_band:integer
spend_total:integer
我先做了这一步:
# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df4.columns, outputCol=vector_col)
df_vector = assembler.transform(df4).select(vector_col)
# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)
并得到以下输出:
(matrix.collect()[0]["pearson({})".format(vector_col)].values)
Out[33]: array([ 1. , 0.0760092 , 0.09051543, 0.07550633, -0.08058203,
-0.24106848, 0.08229602, -0.02975856, -0.03108094, 0.0760092 ,
1. , 0.14792512, -0.10744735, 0.29481762, -0.04490072,
-0.27454922, 0.23242408, 0.32051685, 0.09051543, 0.14792512,
1. , -0.03708623, 0.13719527, -0.01135489, 0.08706559,
0.24713638, 0.37453265, 0.07550633, -0.10744735, -0.03708623,
1. , -0.49640664, 0.01885793, 0.25877516, -0.05019079,
-0.13878844, -0.08058203, 0.29481762, 0.13719527, -0.49640664,
1. , 0.01080777, -0.42319841, 0.01229877, 0.16440178,
-0.24106848, -0.04490072, -0.01135489, 0.01885793, 0.01080777,
1. , 0.00523737, 0.01244241, 0.01811365, 0.08229602,
-0.27454922, 0.08706559, 0.25877516, -0.42319841, 0.00523737,
1. , 0.32888075, 0.21416322, -0.02975856, 0.23242408,
0.24713638, -0.05019079, 0.01229877, 0.01244241, 0.32888075,
1. , 0.53310864, -0.03108094, 0.32051685, 0.37453265,
-0.13878844, 0.16440178, 0.01811365, 0.21416322, 0.53310864,
1. ])
我试图将此结果插入到数组或 excel 文件中,但没有成功。 我做到了:
matrix2 = (matrix.collect()[0]["pearson({})".format(vector_col)])
然后当我尝试显示此信息时出现以下错误:
display(matrix2)
Exception: ML model display does not yet support model type <class 'pyspark.ml.linalg.DenseMatrix'>.
我期待从 df4
插入列的名称,但它没有成功,我读到我需要使用 df4.columns 但我不知道它是如何工作的.
最后,我期待打印出我从媒体文章中看到的下图
但也没有用:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df4.iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 12},
cmap='coolwarm',
yticklabels=cols,
xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients', size = 18)
plt.tight_layout()
plt.show()
AttributeError: 'DataFrame' object has no attribute 'iloc'
我试过将 df4 替换为 matrix2,但没有成功
您可以使用以下方法以您可以操作的形式获取相关矩阵:
matrix = matrix.toArray().tolist()
从那里您可以转换为数据框 pd.DataFrame(matrix)
,这样您就可以绘制热图,或保存到 excel 等