在 Spark DataFrame 上执行 NGram

Perform NGram on Spark DataFrame

我正在使用 Spark 2.3.1,我有这样的 Spark DataFrame

|    values|
|   present|
| invention|
|   include|
|   pairing|
|       two|
|  wireless|
|    device|
|   placing|
|     least|
|       one|
|       two|

我想像这样执行 Spark ml n-Gram 功能。

bigram = NGram(n=2, inputCol="values", outputCol="bigrams")

bigramDataFrame = bigram.transform(tokenized_df)

此行出现以下错误 bigramDataFrame = bigram.transform(tokenized_df)

pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'


df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))

bigram = NGram(n=2, inputCol="values", outputCol="bigrams")

bigramDataFrame = bigram.transform(df_new)


所以我得到了我的最终数据框 Follow

|    values|     testing|bigrams|
|embodiment|[embodiment]|     []|
|   present|   [present]|     []|
| invention| [invention]|     []|
|   include|   [include]|     []|
|   pairing|   [pairing]|     []|
|       two|       [two]|     []|
|  wireless|  [wireless]|     []|
|    device|    [device]|     []|
|   placing|   [placing]|     []|
|     least|     [least]|     []|
|       one|       [one]|     []|
|       two|       [two]|     []|

为什么我的 bigram 列值为空。

我希望 bigram 列的输出如下

|   bigrams|
|embodiment present  |
|present invention   |
|invention include   |
|include pairing     |
|pairing two         |
|two wireless        |
|wireless device     |
|device placing      |
|placing least       |
|least one           |
|one two             |

您的 bi-gram 列值为空,因为 'values' 列的每一行中没有 bi-gram。


|values                                      |
|embodiment present invention include pairing|
|two wireless device placing                 |
|least one two                               |


|values                                      |testing                                           |ngrams                                                                     |
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing                 |[two, wireless, device, placing]                  |[two wireless, wireless device, device placing]                            |
|least one two                               |[least, one, two]                                 |[least one, one two]                                                       |

执行此操作的 scala spark 代码是:

val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)

A bi-gram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words.

但是在您的输入数据框中,每一行只有一个标记,因此您不会从中得到任何 bi-grams。


Input: df1
|values    |
|present   |
|invention |
|include   |
|pairing   |
|two       |
|wireless  |
|devic     |
|placing   |
|least     |
|one       |
|two       |

Output: ngramDataFrameInRows
|ngrams            |
|embodiment present|
|present invention |
|invention include |
|include pairing   |
|pairing two       |
|two wireless      |
|wireless devic    |
|devic placing     |
|placing least     |
|least one         |
|one two           |

Spark scala 代码:

val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))