Pyspark 错误 return _compile(pattern, flags).findall(string) - 如何排除故障?

Pyspark Error with return _compile(pattern, flags).findall(string) - how to troubleshoot?

我正在尝试使用单词列表进行情绪分析,以获取 pyspark 数据框列中正面和负面单词的数量。我可以使用相同的方法成功获得正面词的计数,并且该列表中大约有 2k 个正面词。负面清单的字数大约是原来的两倍(约 4k 个字)。是什么导致了这个问题,我该如何解决?

我认为这不是代码的原因,因为它对正面词有效,但我很困惑我正在搜索的词数是否在另一个列表中太长,或者什么我不见了。下面是一个示例(不是确切的列表):

stories.show()

+--------------------+
|               words|
+--------------------+
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
+--------------------+


neg = ['angry','sad','sorrowful','angry']


#doing some counting manipulation here
df3.show()

错误:

spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1308         answer = self.gateway_client.send_command(command)
   1309         return_value = get_return_value(
-> 1310             answer, self.gateway_client, self.target_id, self.name)
   1311 
   1312         for temp_arg in temp_args:

/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
  File "/usr/lib/python3.7/re.py", line 225, in findall
    return _compile(pattern, flags).findall(string)
  File "/usr/lib/python3.7/re.py", line 288, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
    p = _parse_sub(source, pattern, True, 0)
  File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
    source.tell() - here + len(this))
re.error: multiple repeat at position 5

预期输出:

+--------------------+--------+
|               words|Negative|
+--------------------+--------+
|tom and jerry went t|      45|
|she was angry when g|      12|
|arnold became sad at|      54|

您的 neg 列表包含对正则表达式模式具有特殊含义的字符,因此,您的模式成为无法解析的正则表达式模式。

您可以使用 re.escape() 函数对模式中的特殊字符进行转义。