如何提取 Sparklyr 中的特征重要性?
how to extract the feature importances in Sparklyr?
考虑这个简单的例子
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"),
doc_id = 1:4,
class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source: table<dtrain> [?? x 3]
# Database: spark_connection
text doc_id class
<chr> <int> <dbl>
1 Chinese Beijing Chinese 1 1
2 Chinese Chinese Shanghai 2 1
3 Chinese Macao 3 1
4 Tokyo Japan Chinese 4 0
我可以使用以下 pipeline
轻松训练 decision_tree_classifier
pipeline <- ml_pipeline(
ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
ml_decision_tree_classifier(sc, label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol")
)
model <- ml_fit(pipeline, dtrain_spark)
现在的问题是我无法以有意义的方式提取 feature_importances
。
运行
> ml_stage(model, 'decision_tree_classifier')$feature_importances
[1] 0 0 1 0 0 0
可是我要的是tokens
!在我现实生活中的例子中,我有成千上万个,并且表明很难理解任何东西。
有什么方法可以从上面的矩阵表示中取消 tokens
吗?
谢谢!
您可以轻松组合 CountVectorizerModel
vocabulary
和 feature_importances
:
tibble(
token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)
# A tibble: 6 x 2
token importance
<chr> <dbl>
1 chinese 0
2 japan 1
3 shanghai 0
4 beijing 0
5 tokyo 0
6 macao 0
考虑这个简单的例子
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"),
doc_id = 1:4,
class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source: table<dtrain> [?? x 3]
# Database: spark_connection
text doc_id class
<chr> <int> <dbl>
1 Chinese Beijing Chinese 1 1
2 Chinese Chinese Shanghai 2 1
3 Chinese Macao 3 1
4 Tokyo Japan Chinese 4 0
我可以使用以下 pipeline
decision_tree_classifier
pipeline <- ml_pipeline(
ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
ml_decision_tree_classifier(sc, label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol")
)
model <- ml_fit(pipeline, dtrain_spark)
现在的问题是我无法以有意义的方式提取 feature_importances
。
运行
> ml_stage(model, 'decision_tree_classifier')$feature_importances
[1] 0 0 1 0 0 0
可是我要的是tokens
!在我现实生活中的例子中,我有成千上万个,并且表明很难理解任何东西。
有什么方法可以从上面的矩阵表示中取消 tokens
吗?
谢谢!
您可以轻松组合 CountVectorizerModel
vocabulary
和 feature_importances
:
tibble(
token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)
# A tibble: 6 x 2
token importance
<chr> <dbl>
1 chinese 0
2 japan 1
3 shanghai 0
4 beijing 0
5 tokyo 0
6 macao 0