Pyspark in GCP: ModuleNotFoundError: No module named 'textblob'

Pyspark in GCP: ModuleNotFoundError: No module named 'textblob'

我在 GCP 上的 jupyter notebook 中的 Pyspark 中使用 udf 函数。我想使用 Textblob 对文本进行情感分析。我已经在笔记本中导入了textblob,并且在我的虚拟机终端中尝试了以下代码

pip3 install -U textblob

当我尝试运行下面的代码时

sentiment = udf(lambda x: TextBlob(x).sentiment[0])
spark.udf.register("sentiment", sentiment)
df = df.withColumn('sentiment',sentiment('text').cast('double'))
df.show(1)

我仍然遇到以下错误

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'textblob'

我是 GCP 和云计算的新手。我不知道是什么导致了这个问题。是不是我没有把包安装到正确的路径?

我认为这更像是 jupyter notebook 而非 GCP。但是 jupyter 具有 %pip%conda,您可以使用它们将 python 模块安装到运行 jupyter 的 python 实例。