MLlib:RFormula.fit() 是如何工作的?

MLlib: How does RFormula.fit() work?

使用 Spark MLlib 创建模型的一种可能性是来自 pyspark.ml.featureRFormula 模块,如 docs. However, I can't find any explanation how fit is actually implemented in this case. Does it use a squared loss function or something else? Where can I find this kind of information? The source 中所述,这真的很难理解...

正如 Anoop Toffy 在评论中提到的,您可以找到一个不错的小教程 here。引用教程:

The fit() step determines the mapping of categorical feature values to vector indices in the output, so that the fitted RFormula can be used across different datasets.

>>> formula = RFormula(formula="ArrDelay ~ DepDelay + Distance + aircraft_type")
>>> formula.fit(training).transform(training).show()
+--------------+---------+---------+---------+--------------------+------+
| aircraft_type| Distance| DepDelay| ArrDelay|            features| label|
+--------------+---------+---------+---------+--------------------+------+
|       Balloon|       23|       18|       20| [0.0,0.0,23.0,18.0]|  20.0|
|  Multi-Engine|      815|        2|       -2| [0.0,1.0,815.0,2.0]|  -2.0|
| Single-Engine|      174|        0|        1| [1.0,0.0,174.0,0.0]|   1.0|
+--------------+---------+---------+---------+--------------------+------+