MLlib：RFormula.fit() 是如何工作的？

Question

使用 Spark MLlib 创建模型的一种可能性是来自 pyspark.ml.feature 的 RFormula 模块，如 docs. However, I can't find any explanation how fit is actually implemented in this case. Does it use a squared loss function or something else? Where can I find this kind of information? The source 中所述，这真的很难理解...

Answer 1

正如 Anoop Toffy 在评论中提到的，您可以找到一个不错的小教程 here。引用教程：

The fit() step determines the mapping of categorical feature values to vector indices in the output, so that the fitted RFormula can be used across different datasets.

>>> formula = RFormula(formula="ArrDelay ~ DepDelay + Distance + aircraft_type")
>>> formula.fit(training).transform(training).show()
+--------------+---------+---------+---------+--------------------+------+
| aircraft_type| Distance| DepDelay| ArrDelay|            features| label|
+--------------+---------+---------+---------+--------------------+------+
|       Balloon|       23|       18|       20| [0.0,0.0,23.0,18.0]|  20.0|
|  Multi-Engine|      815|        2|       -2| [0.0,1.0,815.0,2.0]|  -2.0|
| Single-Engine|      174|        0|        1| [1.0,0.0,174.0,0.0]|   1.0|
+--------------+---------+---------+---------+--------------------+------+

MLlib：RFormula.fit() 是如何工作的？

MLlib: How does RFormula.fit() work?

r

machine-learning

apache-spark-mllib