PySpark ChiSqSelector p 值和测试统计数据
PySpark ChiSqSelector p-values and test statistics
我正在使用 PySpark 的 pyspark.ml.feature.ChiSqSelector
来执行特征选择。 apps
是一个包含稀疏矩阵的列,它对应于特定的 name
(机器)是否安装了特定的应用程序。总而言之,某人可能已经安装了 21,615 个可能的应用程序。
在使用 ChiSqSelector
对象拟合和转换新数据后,我对 selected_apps
现在代表什么感到困惑。此处的文档没有帮助。我有几个问题:
1) 如何获得与 21,615 个输入应用程序相关的卡方检验统计数据和 p 值?这似乎无法通过查看 dir(selector)
.
立即获得
2) 为什么selected_apps
中显示的应用不同?我的直觉是下面第二行的机器没有应用程序 0、1、2 等,因此 selected_apps
中显示的是该行的前 50 个应用程序 基于他们的 p 值。这 API 似乎与 scikit-learns SelectKBest(chi2)
的努力有很大不同,后者只返回顶部 k 最相关的特征,而不管特定机器是否具有该功能为“1”。
3) 如何覆盖默认的 numTopFeatures=50
设置?这主要与问题 1) 相关,并且仅利用 p 值进行特征选择。关于此参数,基本上 "forgetting" 似乎没有 numTopFeatures=-1
类型的选项。
>>> selector = ChiSqSelector(
... featuresCol='apps',
... outputCol='selected_apps',
... labelCol='multiple_event',
... fpr=0.05
... )
>>> result = selector.fit(df).transform(df)
>>> print(result.show())
+---------------+-----------+--------------+--------------------+--------------------+
| name|total_event|multiple_event| apps| selected_apps|
+---------------+-----------+--------------+--------------------+--------------------+
|000000000000021| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000022| 0| 0|(21615,[3,6,7,8,9...|(50,[3,6,7,8,9,11...|
|000000000000023| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000024| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000025| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000026| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000027| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000028| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000029| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000030| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000031| 0| 0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000032| 0| 0|(21615,[6,7,8,9,1...|(50,[6,7,8,9,13,1...|
|000000000000033| 0| 0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000034| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000035| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000036| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000037| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000038| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000039| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000040| 0| 0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
+---------------+-----------+--------------+--------------------+--------------------+
我明白了。解决方案如下:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
# Convert everything to a LabeledPoint object, the main consumption
# data structure for most of mllib
to_labeled_point = lambda x: LabeledPoint(x[0], Vectors.dense(x[1].toArray()))
obs = (
df
.select('multiple_event', 'apps')
.rdd
.map(to_labeled_point)
)
# The contingency table is constructed from an RDD of LabeledPoint and used to conduct
# the independence test. Returns an array containing the ChiSquaredTestResult for every feature
# against the label.
feature_test_results = Statistics.chiSqTest(obs)
data = []
for idx, result in enumerate(feature_test_results):
row = {
'feature_index': idx,
'p_value': result.pValue,
'statistic': result.statistic,
'degrees_of_freedom': result.degreesOfFreedom
}
data.append(row)
我正在使用 PySpark 的 pyspark.ml.feature.ChiSqSelector
来执行特征选择。 apps
是一个包含稀疏矩阵的列,它对应于特定的 name
(机器)是否安装了特定的应用程序。总而言之,某人可能已经安装了 21,615 个可能的应用程序。
在使用 ChiSqSelector
对象拟合和转换新数据后,我对 selected_apps
现在代表什么感到困惑。此处的文档没有帮助。我有几个问题:
1) 如何获得与 21,615 个输入应用程序相关的卡方检验统计数据和 p 值?这似乎无法通过查看 dir(selector)
.
2) 为什么selected_apps
中显示的应用不同?我的直觉是下面第二行的机器没有应用程序 0、1、2 等,因此 selected_apps
中显示的是该行的前 50 个应用程序 基于他们的 p 值。这 API 似乎与 scikit-learns SelectKBest(chi2)
的努力有很大不同,后者只返回顶部 k 最相关的特征,而不管特定机器是否具有该功能为“1”。
3) 如何覆盖默认的 numTopFeatures=50
设置?这主要与问题 1) 相关,并且仅利用 p 值进行特征选择。关于此参数,基本上 "forgetting" 似乎没有 numTopFeatures=-1
类型的选项。
>>> selector = ChiSqSelector(
... featuresCol='apps',
... outputCol='selected_apps',
... labelCol='multiple_event',
... fpr=0.05
... )
>>> result = selector.fit(df).transform(df)
>>> print(result.show())
+---------------+-----------+--------------+--------------------+--------------------+
| name|total_event|multiple_event| apps| selected_apps|
+---------------+-----------+--------------+--------------------+--------------------+
|000000000000021| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000022| 0| 0|(21615,[3,6,7,8,9...|(50,[3,6,7,8,9,11...|
|000000000000023| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000024| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000025| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000026| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000027| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000028| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000029| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000030| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000031| 0| 0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000032| 0| 0|(21615,[6,7,8,9,1...|(50,[6,7,8,9,13,1...|
|000000000000033| 0| 0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000034| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000035| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000036| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000037| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000038| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000039| 0| 0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000040| 0| 0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
+---------------+-----------+--------------+--------------------+--------------------+
我明白了。解决方案如下:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
# Convert everything to a LabeledPoint object, the main consumption
# data structure for most of mllib
to_labeled_point = lambda x: LabeledPoint(x[0], Vectors.dense(x[1].toArray()))
obs = (
df
.select('multiple_event', 'apps')
.rdd
.map(to_labeled_point)
)
# The contingency table is constructed from an RDD of LabeledPoint and used to conduct
# the independence test. Returns an array containing the ChiSquaredTestResult for every feature
# against the label.
feature_test_results = Statistics.chiSqTest(obs)
data = []
for idx, result in enumerate(feature_test_results):
row = {
'feature_index': idx,
'p_value': result.pValue,
'statistic': result.statistic,
'degrees_of_freedom': result.degreesOfFreedom
}
data.append(row)