在 pyspark(databricks)中使用来自 NLTK 的停用词时出现酸洗错误
Pickling Error while using Stopwords from NLTK in pyspark (databricks)
我在网上找到了以下函数:
def RemoveStops(data_str):
#nltk.download('stopwords')
english_stopwords = stopwords.words("english")
broadcast(english_stopwords)
# expects a string
stops = set(english_stopwords)
list_pos = 0
cleaned_str = ''
text = data_str.split()
for word in text:
if word not in stops:
# rebuild cleaned_str
if list_pos == 0:
cleaned_str = word
else:
cleaned_str = cleaned_str + ' ' + word
list_pos += 1
return cleaned_str
然后我将执行以下操作:
ColumntoClean = udf(lambda x: RemoveStops(x), StringType())
data = data.withColumn("CleanedText", ColumntoClean(data[TextColumn]))
我收到的错误如下:
PicklingError: args[0] from newobj args has the wrong class
有趣的是,如果我重新运行同一组代码,它会运行并且不会抛出酸洗错误。有人可以帮我解决这个问题吗?谢谢!
只要这样改变你的功能,它应该运行。
nltk.download('stopwords')
english_stopwords = stopwords.words("english")
def RemoveStops(data_str):
# expects a string
stops = set(english_stopwords)
list_pos = 0
cleaned_str = ''
text = data_str.split()
for word in text:
if word not in stops:
# rebuild cleaned_str
if list_pos == 0:
cleaned_str = word
else:
cleaned_str = cleaned_str + ' ' + word
list_pos += 1
return cleaned_str
当谈到 nltk 时,Databricks 很痛苦。在应用 udf 时,它不允许函数内部的 stopwords.words("english") 到 运行。
我在网上找到了以下函数:
def RemoveStops(data_str):
#nltk.download('stopwords')
english_stopwords = stopwords.words("english")
broadcast(english_stopwords)
# expects a string
stops = set(english_stopwords)
list_pos = 0
cleaned_str = ''
text = data_str.split()
for word in text:
if word not in stops:
# rebuild cleaned_str
if list_pos == 0:
cleaned_str = word
else:
cleaned_str = cleaned_str + ' ' + word
list_pos += 1
return cleaned_str
然后我将执行以下操作:
ColumntoClean = udf(lambda x: RemoveStops(x), StringType())
data = data.withColumn("CleanedText", ColumntoClean(data[TextColumn]))
我收到的错误如下:
PicklingError: args[0] from newobj args has the wrong class
有趣的是,如果我重新运行同一组代码,它会运行并且不会抛出酸洗错误。有人可以帮我解决这个问题吗?谢谢!
只要这样改变你的功能,它应该运行。
nltk.download('stopwords')
english_stopwords = stopwords.words("english")
def RemoveStops(data_str):
# expects a string
stops = set(english_stopwords)
list_pos = 0
cleaned_str = ''
text = data_str.split()
for word in text:
if word not in stops:
# rebuild cleaned_str
if list_pos == 0:
cleaned_str = word
else:
cleaned_str = cleaned_str + ' ' + word
list_pos += 1
return cleaned_str
当谈到 nltk 时,Databricks 很痛苦。在应用 udf 时,它不允许函数内部的 stopwords.words("english") 到 运行。