pyspark p 值和 chisquaretest 相关性
pyspark p values and chisquaretest correlations
+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
| date| serial_number| model|capacity_bytes|failure|smart_1_raw|smart_3_raw|smart_4_raw|smart_5_raw|smart_7_raw|smart_9_raw|smart_10_raw|s
+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
|2018-09-23| ZJV01VV0| ST12000NM0007|12000138625024| 0| 32985096| 0|
|2018-09-23| ZJV01VV5| ST12000NM0007|12000138625024| 0| 77197496| 0|
|2018-09-23| PL2331LAH3XLZJ|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 0|
|2018-09-23| ZCH0ATJY| ST12000NM0007|12000138625024| 0| 51954552| 0|
|2018-09-23| ZA1816EB| ST8000NM0055| 8001563222016| 0| 129696704| 0|
|2018-09-23| ZA13ZKX8| ST8000DM002| 8001563222016| 0| 89446512| 0|
|2018-09-23| PL2331LAHDB5PJ|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 442|
|2018-09-23| ZA1816E1| ST8000NM0055| 8001563222016| 0| 8437320| 0|
|2018-09-23| PL2331LAH3WM1J|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 0|
|2018-09-23| S30108NT| ST4000DM000| 4000787030016| 0| 11197576| 0|
|2018-09-23| ZJV01VVG| ST12000NM0007|12000138625024| 0| 172268856| 0|
|2018-09-23| ZJV01VVM| ST12000NM0007|12000138625024| 0| 101040904| 0|
|2018-09-23| ZA174KPY| ST8000NM0055| 8001563222016| 0| 50287344| 0|
|2018-09-23| PL2331LAH3W4XJ|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 530|
|2018-09-23| Z4D068HF| ST6000DX000| 6001175126016| 0| 23293443
应该计算 smart_194_raw 和 "failure" 列之间相关性的 p 值。我不确定如何着手创建标记点和向量等。
下次,请提供您尝试尝试的最少代码。但是,这里有一个关于如何获得 卡方检验 和您的问题的基本统计数据的分步指南。
>>> from pyspark.sql import SparkSession
>>> from pyspark.ml.feature import VectorAssembler
>>> from pyspark.ml.stat import ChiSquareTest
>>> df = spark._sc.parallelize([
[0, 1.0, 0.71, 0.143],
[1, 0.0, 0.97, 0.943],
[0, 0.123, 0.27, 0.443],
[1, 0.67, 0.3457, 0.243],
[1, 0.39, 0.7777, 0.143]
]).toDF(['label', 'col2', 'col3', 'col4'])
>>> df.show()
+-----+-----+------+-----+
|label| col2| col3| col4|
+-----+-----+------+-----+
| 0| 1.0| 0.71|0.143|
| 1| 0.0| 0.97|0.943|
| 0|0.123| 0.27|0.443|
| 1| 0.67|0.3457|0.243|
| 1| 0.39|0.7777|0.143|
+-----+-----+------+-----+
>>> assembler = VectorAssembler(
inputCols=['col2', 'col3', 'col4'],
outputCol="vector_features")
>>> vectorized_df = assembler.transform(df).select('label', 'vector_features')
>>> vectorized_df.show()
+-----+-------------------+
|label| vector_features|
+-----+-------------------+
| 0| [1.0,0.71,0.143]|
| 1| [0.0,0.97,0.943]|
| 0| [0.123,0.27,0.443]|
| 1|[0.67,0.3457,0.243]|
| 1|[0.39,0.7777,0.143]|
+-----+-------------------+
>>> r = ChiSquareTest.test(vectorized_df, "vector_features", "label").head()
>>> print("pValues: " + str(r.pValues))
>>> print("degreesOfFreedom: " + str(r.degreesOfFreedom))
>>> print("statistics: " + str(r.statistics))
pValues: [0.2872974951836462,0.2872974951836462,0.40465279495160544]
degreesOfFreedom: [4, 4, 3]
statistics: [5.0,5.0,2.916666666666667]
+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
| date| serial_number| model|capacity_bytes|failure|smart_1_raw|smart_3_raw|smart_4_raw|smart_5_raw|smart_7_raw|smart_9_raw|smart_10_raw|s
+----------+---------------+--------------------+--------------+-------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------+
|2018-09-23| ZJV01VV0| ST12000NM0007|12000138625024| 0| 32985096| 0|
|2018-09-23| ZJV01VV5| ST12000NM0007|12000138625024| 0| 77197496| 0|
|2018-09-23| PL2331LAH3XLZJ|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 0|
|2018-09-23| ZCH0ATJY| ST12000NM0007|12000138625024| 0| 51954552| 0|
|2018-09-23| ZA1816EB| ST8000NM0055| 8001563222016| 0| 129696704| 0|
|2018-09-23| ZA13ZKX8| ST8000DM002| 8001563222016| 0| 89446512| 0|
|2018-09-23| PL2331LAHDB5PJ|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 442|
|2018-09-23| ZA1816E1| ST8000NM0055| 8001563222016| 0| 8437320| 0|
|2018-09-23| PL2331LAH3WM1J|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 0|
|2018-09-23| S30108NT| ST4000DM000| 4000787030016| 0| 11197576| 0|
|2018-09-23| ZJV01VVG| ST12000NM0007|12000138625024| 0| 172268856| 0|
|2018-09-23| ZJV01VVM| ST12000NM0007|12000138625024| 0| 101040904| 0|
|2018-09-23| ZA174KPY| ST8000NM0055| 8001563222016| 0| 50287344| 0|
|2018-09-23| PL2331LAH3W4XJ|HGST HMS5C4040BLE640| 4000787030016| 0| 0| 530|
|2018-09-23| Z4D068HF| ST6000DX000| 6001175126016| 0| 23293443
应该计算 smart_194_raw 和 "failure" 列之间相关性的 p 值。我不确定如何着手创建标记点和向量等。
下次,请提供您尝试尝试的最少代码。但是,这里有一个关于如何获得 卡方检验 和您的问题的基本统计数据的分步指南。
>>> from pyspark.sql import SparkSession
>>> from pyspark.ml.feature import VectorAssembler
>>> from pyspark.ml.stat import ChiSquareTest
>>> df = spark._sc.parallelize([
[0, 1.0, 0.71, 0.143],
[1, 0.0, 0.97, 0.943],
[0, 0.123, 0.27, 0.443],
[1, 0.67, 0.3457, 0.243],
[1, 0.39, 0.7777, 0.143]
]).toDF(['label', 'col2', 'col3', 'col4'])
>>> df.show()
+-----+-----+------+-----+
|label| col2| col3| col4|
+-----+-----+------+-----+
| 0| 1.0| 0.71|0.143|
| 1| 0.0| 0.97|0.943|
| 0|0.123| 0.27|0.443|
| 1| 0.67|0.3457|0.243|
| 1| 0.39|0.7777|0.143|
+-----+-----+------+-----+
>>> assembler = VectorAssembler(
inputCols=['col2', 'col3', 'col4'],
outputCol="vector_features")
>>> vectorized_df = assembler.transform(df).select('label', 'vector_features')
>>> vectorized_df.show()
+-----+-------------------+
|label| vector_features|
+-----+-------------------+
| 0| [1.0,0.71,0.143]|
| 1| [0.0,0.97,0.943]|
| 0| [0.123,0.27,0.443]|
| 1|[0.67,0.3457,0.243]|
| 1|[0.39,0.7777,0.143]|
+-----+-------------------+
>>> r = ChiSquareTest.test(vectorized_df, "vector_features", "label").head()
>>> print("pValues: " + str(r.pValues))
>>> print("degreesOfFreedom: " + str(r.degreesOfFreedom))
>>> print("statistics: " + str(r.statistics))
pValues: [0.2872974951836462,0.2872974951836462,0.40465279495160544]
degreesOfFreedom: [4, 4, 3]
statistics: [5.0,5.0,2.916666666666667]