Spark:句子上的 StringIndexer
Spark: StringIndexer on sentences
我正在尝试对一列句子执行 StringIndexer 操作,即将单词列表转换为整数列表。
例如:
输入数据集:
(1, ["I", "like", "Spark"])
(2, ["I", "hate", "Spark"])
我希望 StringIndexer 之后的输出类似于:
(1, [0, 2, 1])
(2, [0, 3, 1])
理想情况下,我想将这种转换作为管道的一部分进行,这样我就可以将转换器链接在一起并序列化以供在线服务。
这是 Spark 本身支持的东西吗?
谢谢!
用于将文本转换为特征的标准 Transformers
是 CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts.
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
两者都有 binary
选项,可用于从计数切换到二进制向量。
没有内置函数 Transfomer
可以给出您想要的准确结果(它对 ML 算法没有用)购买您可以 explode
应用 StringIndexer
,并且 collect_list
/ collect_set
:
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
val df = Seq(
(1, Array("I", "like", "Spark")), (2, Array("I", "hate", "Spark"))
).toDF("id", "words")
val pipeline = new Pipeline().setStages(Array(
new SQLTransformer()
.setStatement("SELECT id, explode(words) as word FROM __THIS__"),
new StringIndexer().setInputCol("word").setOutputCol("index"),
new SQLTransformer()
.setStatement("""SELECT id, COLLECT_SET(index) AS values
FROM __THIS__ GROUP BY id""")
))
pipeline.fit(df).transform(df).show
// +---+---------------+
// | id| values|
// +---+---------------+
// | 1|[0.0, 1.0, 3.0]|
// | 2|[2.0, 0.0, 1.0]|
// +---+---------------+
与 CountVectorizer
和 udf
:
import org.apache.spark.ml.linalg._
spark.udf.register("indices", (v: Vector) => v.toSparse.indices)
val pipeline = new Pipeline().setStages(Array(
new CountVectorizer().setInputCol("words").setOutputCol("vector"),
new SQLTransformer()
.setStatement("SELECT *, indices(vector) FROM __THIS__")
))
pipeline.fit(df).transform(df).show
// +---+----------------+--------------------+-------------------+
// | id| words| vector|UDF:indices(vector)|
// +---+----------------+--------------------+-------------------+
// | 1|[I, like, Spark]|(4,[0,1,3],[1.0,1...| [0, 1, 3]|
// | 2|[I, hate, Spark]|(4,[0,1,2],[1.0,1...| [0, 1, 2]|
// +---+----------------+--------------------+-------------------+
我正在尝试对一列句子执行 StringIndexer 操作,即将单词列表转换为整数列表。
例如:
输入数据集:
(1, ["I", "like", "Spark"])
(2, ["I", "hate", "Spark"])
我希望 StringIndexer 之后的输出类似于:
(1, [0, 2, 1])
(2, [0, 3, 1])
理想情况下,我想将这种转换作为管道的一部分进行,这样我就可以将转换器链接在一起并序列化以供在线服务。
这是 Spark 本身支持的东西吗?
谢谢!
用于将文本转换为特征的标准 Transformers
是 CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts.
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
两者都有 binary
选项,可用于从计数切换到二进制向量。
没有内置函数 Transfomer
可以给出您想要的准确结果(它对 ML 算法没有用)购买您可以 explode
应用 StringIndexer
,并且 collect_list
/ collect_set
:
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
val df = Seq(
(1, Array("I", "like", "Spark")), (2, Array("I", "hate", "Spark"))
).toDF("id", "words")
val pipeline = new Pipeline().setStages(Array(
new SQLTransformer()
.setStatement("SELECT id, explode(words) as word FROM __THIS__"),
new StringIndexer().setInputCol("word").setOutputCol("index"),
new SQLTransformer()
.setStatement("""SELECT id, COLLECT_SET(index) AS values
FROM __THIS__ GROUP BY id""")
))
pipeline.fit(df).transform(df).show
// +---+---------------+
// | id| values|
// +---+---------------+
// | 1|[0.0, 1.0, 3.0]|
// | 2|[2.0, 0.0, 1.0]|
// +---+---------------+
与 CountVectorizer
和 udf
:
import org.apache.spark.ml.linalg._
spark.udf.register("indices", (v: Vector) => v.toSparse.indices)
val pipeline = new Pipeline().setStages(Array(
new CountVectorizer().setInputCol("words").setOutputCol("vector"),
new SQLTransformer()
.setStatement("SELECT *, indices(vector) FROM __THIS__")
))
pipeline.fit(df).transform(df).show
// +---+----------------+--------------------+-------------------+
// | id| words| vector|UDF:indices(vector)|
// +---+----------------+--------------------+-------------------+
// | 1|[I, like, Spark]|(4,[0,1,3],[1.0,1...| [0, 1, 3]|
// | 2|[I, hate, Spark]|(4,[0,1,2],[1.0,1...| [0, 1, 2]|
// +---+----------------+--------------------+-------------------+