PySpark:有没有办法在一次操作中执行 .fit() 和 .transform()?
PySpark: Is there a way to do a .fit() and .transform() in one operation?
我正在尝试研究如何在 PySpark 中优化我的 .fit() 和 .transform()
我有:
pipeline = Pipeline(stages=[topic_vectorizer_A, cat_vectorizer_A,
topic_vectorizer_B, cat_vectorizer_B,
fil_top_a_vect, fil_top_b_vect,
fil_cat_a_vect, fil_cat_b_vect,
fil_ent_a_vect, fil_ent_b_vect,
assembler])
# Note that all the operations in the pipeline are transforms only.
model = pipeline.fit(cleaned)
# wait 12 hours
vectorized_df = model.transform(cleaned)
# wait another XX hours
# save to parquet.
我见过这样的事情:
vectorized_df = model.fit(cleaned).transform(cleaned)
但我不确定这是否相同,或者以某种方式优化了操作
无事可做。如果
- 阶段是
Estimator
(如 CountVectorizer
),它在 Pipeline.fit
. 中训练
- stage是一个
Transformer
(像HashingTF
)直接返回。
我正在尝试研究如何在 PySpark 中优化我的 .fit() 和 .transform()
我有:
pipeline = Pipeline(stages=[topic_vectorizer_A, cat_vectorizer_A,
topic_vectorizer_B, cat_vectorizer_B,
fil_top_a_vect, fil_top_b_vect,
fil_cat_a_vect, fil_cat_b_vect,
fil_ent_a_vect, fil_ent_b_vect,
assembler])
# Note that all the operations in the pipeline are transforms only.
model = pipeline.fit(cleaned)
# wait 12 hours
vectorized_df = model.transform(cleaned)
# wait another XX hours
# save to parquet.
我见过这样的事情:
vectorized_df = model.fit(cleaned).transform(cleaned)
但我不确定这是否相同,或者以某种方式优化了操作
无事可做。如果
- 阶段是
Estimator
(如CountVectorizer
),它在Pipeline.fit
. 中训练
- stage是一个
Transformer
(像HashingTF
)直接返回。