在 pyspark（databricks）中使用来自 NLTK 的停用词时出现酸洗错误

Question

我在网上找到了以下函数：

def RemoveStops(data_str):    
    #nltk.download('stopwords')
    english_stopwords = stopwords.words("english")
    broadcast(english_stopwords)
    # expects a string
    stops = set(english_stopwords)
    list_pos = 0
    cleaned_str = ''
    text = data_str.split()
    for word in text:
        if word not in stops:
            # rebuild cleaned_str
            if list_pos == 0:
                cleaned_str = word
            else:
                cleaned_str = cleaned_str + ' ' + word
            list_pos += 1
    return cleaned_str

然后我将执行以下操作：

ColumntoClean = udf(lambda x: RemoveStops(x), StringType())
data = data.withColumn("CleanedText", ColumntoClean(data[TextColumn]))

我收到的错误如下：

PicklingError: args[0] from newobj args has the wrong class

有趣的是，如果我重新运行同一组代码，它会运行并且不会抛出酸洗错误。有人可以帮我解决这个问题吗？谢谢！

Answer 1

只要这样改变你的功能，它应该运行。

nltk.download('stopwords')
english_stopwords = stopwords.words("english")
def RemoveStops(data_str):    
    # expects a string
    stops = set(english_stopwords)
    list_pos = 0
    cleaned_str = ''
    text = data_str.split()
    for word in text:
        if word not in stops:
            # rebuild cleaned_str
            if list_pos == 0:
                cleaned_str = word
            else:
                cleaned_str = cleaned_str + ' ' + word
            list_pos += 1
    return cleaned_str

当谈到 nltk 时，Databricks 很痛苦。在应用 udf 时，它不允许函数内部的 stopwords.words("english") 到运行。

在 pyspark（databricks）中使用来自 NLTK 的停用词时出现酸洗错误

Pickling Error while using Stopwords from NLTK in pyspark (databricks)

nltk

stop-words

pyspark