ml.feature.StringIndexer 和 IndexToString 如何工作?

How does ml.feature.StringIndexer and IndexToString work?

从官方文档来看,他们不互相交谈,但他们可以一起工作。

df = sqlCtx.createDataFrame([(0, "a",'s', ['dance']), (1, "b",'b', ['sing']), (2, "c",'a', ['dance','sing']), (3, "a",'x', []), (4, "a",'xx',['football']), (5, "c",'w', ['dance'])],["id", "rand_one",'rand_two', 'hobbies'])
indexer_one = StringIndexer(inputCol='rand_one', outputCol='one')
indexer_two = StringIndexer(inputCol='rand_two', outputCol='two')
transformed_one = indexer_one.fit(df).transform(df)
transformed_two = indexer_two.fit(transformed_one).transform(transformed_one)
get_back_one = IndexToString(inputCol='one', outputCol='origin_one')
get_back_two = IndexToString(inputCol='two', outputCol='origin_two')
magic_back = get_back_two.transform(transformed_two)

这是怎么发生的?因为只有 indexer_one 有映射信息,而 get_back_one 有任何值赋值

根据 documentation:

"...we are able to retrieve our original labels (they will be inferred from the columns’ metadata)."

因此映射确实存在,这就是为什么如果我们创建列的副本 transformed_two['two']:

transformed_two = transformed_two.withColumn('two_test', transformed_two['two'].cast('double'))

然后尝试执行 IndexToString:

get_back_two = IndexToString(inputCol='two_test', outputCol='origin_two')
magic_back = get_back_two.transform(transformed_two)

我们收到以下错误:

Java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute