ml.feature.StringIndexer 和 IndexToString 如何工作?
How does ml.feature.StringIndexer and IndexToString work?
从官方文档来看,他们不互相交谈,但他们可以一起工作。
df = sqlCtx.createDataFrame([(0, "a",'s', ['dance']), (1, "b",'b', ['sing']), (2, "c",'a', ['dance','sing']), (3, "a",'x', []), (4, "a",'xx',['football']), (5, "c",'w', ['dance'])],["id", "rand_one",'rand_two', 'hobbies'])
indexer_one = StringIndexer(inputCol='rand_one', outputCol='one')
indexer_two = StringIndexer(inputCol='rand_two', outputCol='two')
transformed_one = indexer_one.fit(df).transform(df)
transformed_two = indexer_two.fit(transformed_one).transform(transformed_one)
get_back_one = IndexToString(inputCol='one', outputCol='origin_one')
get_back_two = IndexToString(inputCol='two', outputCol='origin_two')
magic_back = get_back_two.transform(transformed_two)
这是怎么发生的?因为只有 indexer_one 有映射信息,而 get_back_one 有任何值赋值
根据 documentation:
"...we are able to retrieve our original labels (they will be inferred
from the columns’ metadata)."
因此映射确实存在,这就是为什么如果我们创建列的副本 transformed_two['two']
:
transformed_two = transformed_two.withColumn('two_test', transformed_two['two'].cast('double'))
然后尝试执行 IndexToString
:
get_back_two = IndexToString(inputCol='two_test', outputCol='origin_two')
magic_back = get_back_two.transform(transformed_two)
我们收到以下错误:
Java.lang.ClassCastException:
org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to
org.apache.spark.ml.attribute.NominalAttribute
从官方文档来看,他们不互相交谈,但他们可以一起工作。
df = sqlCtx.createDataFrame([(0, "a",'s', ['dance']), (1, "b",'b', ['sing']), (2, "c",'a', ['dance','sing']), (3, "a",'x', []), (4, "a",'xx',['football']), (5, "c",'w', ['dance'])],["id", "rand_one",'rand_two', 'hobbies'])
indexer_one = StringIndexer(inputCol='rand_one', outputCol='one')
indexer_two = StringIndexer(inputCol='rand_two', outputCol='two')
transformed_one = indexer_one.fit(df).transform(df)
transformed_two = indexer_two.fit(transformed_one).transform(transformed_one)
get_back_one = IndexToString(inputCol='one', outputCol='origin_one')
get_back_two = IndexToString(inputCol='two', outputCol='origin_two')
magic_back = get_back_two.transform(transformed_two)
这是怎么发生的?因为只有 indexer_one 有映射信息,而 get_back_one 有任何值赋值
根据 documentation:
"...we are able to retrieve our original labels (they will be inferred from the columns’ metadata)."
因此映射确实存在,这就是为什么如果我们创建列的副本 transformed_two['two']
:
transformed_two = transformed_two.withColumn('two_test', transformed_two['two'].cast('double'))
然后尝试执行 IndexToString
:
get_back_two = IndexToString(inputCol='two_test', outputCol='origin_two')
magic_back = get_back_two.transform(transformed_two)
我们收到以下错误:
Java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute