pyspark p 值和 chisquaretest 相关性

Question

+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
|      date|  serial_number|               model|capacity_bytes|failure|smart_1_raw|smart_3_raw|smart_4_raw|smart_5_raw|smart_7_raw|smart_9_raw|smart_10_raw|s
+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
|2018-09-23|       ZJV01VV0|       ST12000NM0007|12000138625024|      0|   32985096|          0|  
|2018-09-23|       ZJV01VV5|       ST12000NM0007|12000138625024|      0|   77197496|          0|  
|2018-09-23| PL2331LAH3XLZJ|HGST HMS5C4040BLE640| 4000787030016|      0|          0|          0| 
|2018-09-23|       ZCH0ATJY|       ST12000NM0007|12000138625024|      0|   51954552|          0|  
|2018-09-23|       ZA1816EB|        ST8000NM0055| 8001563222016|      0|  129696704|          0| 
|2018-09-23|       ZA13ZKX8|         ST8000DM002| 8001563222016|      0|   89446512|          0| 
|2018-09-23| PL2331LAHDB5PJ|HGST HMS5C4040BLE640| 4000787030016|      0|          0|        442| 
|2018-09-23|       ZA1816E1|        ST8000NM0055| 8001563222016|      0|    8437320|          0| 
|2018-09-23| PL2331LAH3WM1J|HGST HMS5C4040BLE640| 4000787030016|      0|          0|          0| 
|2018-09-23|       S30108NT|         ST4000DM000| 4000787030016|      0|   11197576|          0| 
|2018-09-23|       ZJV01VVG|       ST12000NM0007|12000138625024|      0|  172268856|          0|  
|2018-09-23|       ZJV01VVM|       ST12000NM0007|12000138625024|      0|  101040904|          0|  
|2018-09-23|       ZA174KPY|        ST8000NM0055| 8001563222016|      0|   50287344|          0| 
|2018-09-23| PL2331LAH3W4XJ|HGST HMS5C4040BLE640| 4000787030016|      0|          0|        530| 
|2018-09-23|       Z4D068HF|         ST6000DX000| 6001175126016|      0|  23293443

应该计算 smart_194_raw 和 "failure" 列之间相关性的 p 值。我不确定如何着手创建标记点和向量等。

Answer 1

下次，请提供您尝试尝试的最少代码。但是，这里有一个关于如何获得 卡方检验 和您的问题的基本统计数据的分步指南。

>>> from pyspark.sql import SparkSession
>>> from pyspark.ml.feature import VectorAssembler
>>> from pyspark.ml.stat import ChiSquareTest

>>> df = spark._sc.parallelize([
    [0, 1.0, 0.71, 0.143],
    [1, 0.0, 0.97, 0.943],
    [0, 0.123, 0.27, 0.443],
    [1, 0.67, 0.3457, 0.243],
    [1, 0.39, 0.7777, 0.143]
]).toDF(['label', 'col2', 'col3', 'col4'])

>>> df.show()
+-----+-----+------+-----+
|label| col2|  col3| col4|
+-----+-----+------+-----+
|    0|  1.0|  0.71|0.143|
|    1|  0.0|  0.97|0.943|
|    0|0.123|  0.27|0.443|
|    1| 0.67|0.3457|0.243|
|    1| 0.39|0.7777|0.143|
+-----+-----+------+-----+


>>> assembler = VectorAssembler(
    inputCols=['col2', 'col3', 'col4'],
    outputCol="vector_features")

>>> vectorized_df = assembler.transform(df).select('label', 'vector_features')

>>> vectorized_df.show()
+-----+-------------------+
|label|    vector_features|
+-----+-------------------+
|    0|   [1.0,0.71,0.143]|
|    1|   [0.0,0.97,0.943]|
|    0| [0.123,0.27,0.443]|
|    1|[0.67,0.3457,0.243]|
|    1|[0.39,0.7777,0.143]|
+-----+-------------------+


>>> r = ChiSquareTest.test(vectorized_df, "vector_features", "label").head()
>>> print("pValues: " + str(r.pValues))
>>> print("degreesOfFreedom: " + str(r.degreesOfFreedom))
>>> print("statistics: " + str(r.statistics))

pValues: [0.2872974951836462,0.2872974951836462,0.40465279495160544]
degreesOfFreedom: [4, 4, 3]
statistics: [5.0,5.0,2.916666666666667]

pyspark p 值和 chisquaretest 相关性

pyspark p values and chisquaretest correlations

statistics

chi-squared

p-value

pyspark