在 Spark DataFrame 上执行 NGram
Perform NGram on Spark DataFrame
我正在使用 Spark 2.3.1,我有这样的 Spark DataFrame
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| placing|
| least|
| one|
| two|
+----------+
我想像这样执行 Spark ml n-Gram 功能。
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(tokenized_df)
此行出现以下错误 bigramDataFrame = bigram.transform(tokenized_df)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'
所以我更改了代码
df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(df_new)
bigramDataFrame.show()
所以我得到了我的最终数据框 Follow
+----------+------------+-------+
| values| testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]| []|
| present| [present]| []|
| invention| [invention]| []|
| include| [include]| []|
| pairing| [pairing]| []|
| two| [two]| []|
| wireless| [wireless]| []|
| device| [device]| []|
| placing| [placing]| []|
| least| [least]| []|
| one| [one]| []|
| two| [two]| []|
+----------+------------+-------+
为什么我的 bigram 列值为空。
我希望 bigram 列的输出如下
+----------+
| bigrams|
+--------------------+
|embodiment present |
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless device |
|device placing |
|placing least |
|least one |
|one two |
+--------------------+
您的 bi-gram 列值为空,因为 'values' 列的每一行中没有 bi-gram。
如果您在输入数据框中的值如下所示:
+--------------------------------------------+
|values |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing |
|least one two |
+--------------------------------------------+
然后你可以在bi-gram秒内得到如下输出:
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values |testing |ngrams |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing |[two, wireless, device, placing] |[two wireless, wireless device, device placing] |
|least one two |[least, one, two] |[least one, one two] |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
执行此操作的 scala spark 代码是:
val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
A bi-gram is a sequence of two adjacent elements from a string of
tokens, which are typically letters, syllables, or words.
但是在您的输入数据框中,每一行只有一个标记,因此您不会从中得到任何 bi-grams。
所以,对于你的问题,你可以这样做:
Input: df1
+----------+
|values |
+----------+
|embodiment|
|present |
|invention |
|include |
|pairing |
|two |
|wireless |
|devic |
|placing |
|least |
|one |
|two |
+----------+
Output: ngramDataFrameInRows
+------------------+
|ngrams |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless devic |
|devic placing |
|placing least |
|least one |
|one two |
+------------------+
Spark scala 代码:
val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))
我正在使用 Spark 2.3.1,我有这样的 Spark DataFrame
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| placing|
| least|
| one|
| two|
+----------+
我想像这样执行 Spark ml n-Gram 功能。
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(tokenized_df)
此行出现以下错误 bigramDataFrame = bigram.transform(tokenized_df)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'
所以我更改了代码
df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(df_new)
bigramDataFrame.show()
所以我得到了我的最终数据框 Follow
+----------+------------+-------+
| values| testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]| []|
| present| [present]| []|
| invention| [invention]| []|
| include| [include]| []|
| pairing| [pairing]| []|
| two| [two]| []|
| wireless| [wireless]| []|
| device| [device]| []|
| placing| [placing]| []|
| least| [least]| []|
| one| [one]| []|
| two| [two]| []|
+----------+------------+-------+
为什么我的 bigram 列值为空。
我希望 bigram 列的输出如下
+----------+
| bigrams|
+--------------------+
|embodiment present |
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless device |
|device placing |
|placing least |
|least one |
|one two |
+--------------------+
您的 bi-gram 列值为空,因为 'values' 列的每一行中没有 bi-gram。
如果您在输入数据框中的值如下所示:
+--------------------------------------------+
|values |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing |
|least one two |
+--------------------------------------------+
然后你可以在bi-gram秒内得到如下输出:
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values |testing |ngrams |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing |[two, wireless, device, placing] |[two wireless, wireless device, device placing] |
|least one two |[least, one, two] |[least one, one two] |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
执行此操作的 scala spark 代码是:
val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
A bi-gram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words.
但是在您的输入数据框中,每一行只有一个标记,因此您不会从中得到任何 bi-grams。
所以,对于你的问题,你可以这样做:
Input: df1
+----------+
|values |
+----------+
|embodiment|
|present |
|invention |
|include |
|pairing |
|two |
|wireless |
|devic |
|placing |
|least |
|one |
|two |
+----------+
Output: ngramDataFrameInRows
+------------------+
|ngrams |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless devic |
|devic placing |
|placing least |
|least one |
|one two |
+------------------+
Spark scala 代码:
val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))