PySpark 计算关联
PySpark computing correlation
我想使用 pyspark.mllib.stat.Statistics.corr
函数来计算 pyspark.sql.dataframe.DataFrame
对象的两列之间的相关性。 corr
函数需要 rdd
个 Vectors
个对象。如何将 df['some_name']
的列转换为 Vectors.dense
对象的 rdd
?
应该没有这个必要。对于数值,您可以直接使用 DataFrameStatFunctions.corr
:
计算相关性
df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"])
df1.stat.corr("x", "y")
# -1.0
否则你可以使用VectorAssembler
:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembler.transform(df).select("features").flatMap(lambda x: x)
好的,我明白了:
v1 = df.flatMap(lambda x: Vectors.dense(x[col_idx_1]))
v2 = df.flatMap(lambda x: Vectors.dense(x[col_idx_2]))
df.stat.corr("column1","column2")
from pyspark.ml.stat import Correlation
from pyspark.ml.linalg import DenseMatrix, Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import *
# Loading Data with more than 50 features
newdata = spark.read.csv("sample*.csv",inferSchema=True,header=True)
assembler = VectorAssembler(inputCols=newdata.columns,
outputCol="features",handleInvalid='keep')
df = assembler.transform(newdata).select("features")
# correlation will be in Dense Matrix
correlation = Correlation.corr(df,"features","pearson").collect()[0][0]
# To convert Dense Matrix into DataFrame
rows = correlation.toArray().tolist()
df = spark.createDataFrame(rows,newdata.columns)
我想使用 pyspark.mllib.stat.Statistics.corr
函数来计算 pyspark.sql.dataframe.DataFrame
对象的两列之间的相关性。 corr
函数需要 rdd
个 Vectors
个对象。如何将 df['some_name']
的列转换为 Vectors.dense
对象的 rdd
?
应该没有这个必要。对于数值,您可以直接使用 DataFrameStatFunctions.corr
:
df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"])
df1.stat.corr("x", "y")
# -1.0
否则你可以使用VectorAssembler
:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembler.transform(df).select("features").flatMap(lambda x: x)
好的,我明白了:
v1 = df.flatMap(lambda x: Vectors.dense(x[col_idx_1]))
v2 = df.flatMap(lambda x: Vectors.dense(x[col_idx_2]))
df.stat.corr("column1","column2")
from pyspark.ml.stat import Correlation
from pyspark.ml.linalg import DenseMatrix, Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import *
# Loading Data with more than 50 features
newdata = spark.read.csv("sample*.csv",inferSchema=True,header=True)
assembler = VectorAssembler(inputCols=newdata.columns,
outputCol="features",handleInvalid='keep')
df = assembler.transform(newdata).select("features")
# correlation will be in Dense Matrix
correlation = Correlation.corr(df,"features","pearson").collect()[0][0]
# To convert Dense Matrix into DataFrame
rows = correlation.toArray().tolist()
df = spark.createDataFrame(rows,newdata.columns)