标记化 Spark 数据框中的列时出现 TypeError

Question

我正在尝试标记化 spark 数据集中的 'string' 列。

spark数据帧如下：

df: 
index ---> Integer 
question ---> String

这就是我使用 spark 分词器的方式：

Quest = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol=Quest, outputCol="question_parts")

但我收到以下错误：

Invalid param value given for param "inputCol". Could not convert <class 'pyspark.sql.dataframe.DataFrame'> to string type

我也用以下代码替换了我的代码的第一行，但他们也没有解决这个错误：

Quest = df.select(concat_ws(" ",col("question")))

和

Quest= df.withColumn("question", concat_ws(" ",col("question")))

我这里有什么错误？

Answer 1

错误是第二行。 df.withColumn() returns 一个数据框，其中附加了您刚刚创建的列。在第二行，inputCol="question" 应该给你你所需要的。然后，您需要使用分词器转换您的数据框。

尝试：

df = df.withColumn("Question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol="Question", outputCol="question_parts")
tokenizer.Transform(df)

编辑：
我不确定您打算在第一行中创建一个新列 - 我已将 withColumn 方法中的列名称从 "question" 更改为 "Question" 以替换现有列。它还从您的数据中看起来像该列已经是字符串格式 - 如果是这样则不需要此步骤。

标记化 Spark 数据框中的列时出现 TypeError

TypeError while tokenizing a column in Spark dataframe

python

dataframe

apache-spark