如何使用 pyspark 确定 pca 中的最佳特征数
How to determine the optimum number of features in pca with pyspark
借助 sci-kit learn,我们可以根据如下所示的累积方差图来决定我们想要保留的特征数量
from sklearn.decomposition import PCA
pca = PCA() # init pca
pca.fit(dataset) # fit the dataset into pca model
pca.explained_variance_ratio # this attribute shows how much variance is explained by each of the seven individual component
we can plot the cumulative value as below
plt.figure(figsize= (10, 8)) # size of the chart(size of the vectors)
cumulativeValue = pca.explained_variance_ratio_.cumsum() # get the cumulative sum
plt.plot(range(1,8), cumulativeValue, marker = 'o', linestyle="--")
然后将近 80% 是我们可以为 pca 选择的最佳特征数量。
我的问题是如何使用 pyspark 确定最佳特征数量
我们可以在 explainedVariance
的帮助下确定这一点,我是如何做到的。
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA
# used vector assembler to create the input the vector
vectorAssembler = VectorAssembler(inputCols=['inputCol1', 'inputCol2', 'inputCol3', 'inputCol4'], outputCol='pcaInput')
df = vectorAssembler.transform(dataset) # fetch data into vector assembler
pca = PCA(k=8, inputCol="pcaInput", outputCol="features") # here I Have defined maximum number of features that I have
pcaModel = pca.fit(df) # fit the data to pca to make the model
print(pcaModel.explainedVariance) # here it will explain the variances
cumValues = pcaModel.explainedVariance.cumsum() # get the cumulative values
# plot the graph
plt.figure(figsize=(10,8))
plt.plot(range(1,9), cumValues, marker = 'o', linestyle='--')
plt.title('variance by components')
plt.xlabel('num of components')
plt.ylabel('cumulative explained variance')
选择参数数量接近80%
所以在这种情况下,参数的最佳数量是 2
借助 sci-kit learn,我们可以根据如下所示的累积方差图来决定我们想要保留的特征数量
from sklearn.decomposition import PCA
pca = PCA() # init pca
pca.fit(dataset) # fit the dataset into pca model
pca.explained_variance_ratio # this attribute shows how much variance is explained by each of the seven individual component
we can plot the cumulative value as below
plt.figure(figsize= (10, 8)) # size of the chart(size of the vectors)
cumulativeValue = pca.explained_variance_ratio_.cumsum() # get the cumulative sum
plt.plot(range(1,8), cumulativeValue, marker = 'o', linestyle="--")
然后将近 80% 是我们可以为 pca 选择的最佳特征数量。
我的问题是如何使用 pyspark 确定最佳特征数量
我们可以在 explainedVariance
的帮助下确定这一点,我是如何做到的。
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA
# used vector assembler to create the input the vector
vectorAssembler = VectorAssembler(inputCols=['inputCol1', 'inputCol2', 'inputCol3', 'inputCol4'], outputCol='pcaInput')
df = vectorAssembler.transform(dataset) # fetch data into vector assembler
pca = PCA(k=8, inputCol="pcaInput", outputCol="features") # here I Have defined maximum number of features that I have
pcaModel = pca.fit(df) # fit the data to pca to make the model
print(pcaModel.explainedVariance) # here it will explain the variances
cumValues = pcaModel.explainedVariance.cumsum() # get the cumulative values
# plot the graph
plt.figure(figsize=(10,8))
plt.plot(range(1,9), cumValues, marker = 'o', linestyle='--')
plt.title('variance by components')
plt.xlabel('num of components')
plt.ylabel('cumulative explained variance')
选择参数数量接近80%
所以在这种情况下,参数的最佳数量是 2