Pyspark 错误 return _compile(pattern, flags).findall(string) - 如何排除故障?
Pyspark Error with return _compile(pattern, flags).findall(string) - how to troubleshoot?
我正在尝试使用单词列表进行情绪分析,以获取 pyspark 数据框列中正面和负面单词的数量。我可以使用相同的方法成功获得正面词的计数,并且该列表中大约有 2k 个正面词。负面清单的字数大约是原来的两倍(约 4k 个字)。是什么导致了这个问题,我该如何解决?
我认为这不是代码的原因,因为它对正面词有效,但我很困惑我正在搜索的词数是否在另一个列表中太长,或者什么我不见了。下面是一个示例(不是确切的列表):
stories.show()
+--------------------+
| words|
+--------------------+
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
+--------------------+
neg = ['angry','sad','sorrowful','angry']
#doing some counting manipulation here
df3.show()
错误:
spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308 answer = self.gateway_client.send_command(command)
1309 return_value = get_return_value(
-> 1310 answer, self.gateway_client, self.target_id, self.name)
1311
1312 for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib/python3.7/re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
p = _parse_sub(source, pattern, True, 0)
File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
source.tell() - here + len(this))
re.error: multiple repeat at position 5
预期输出:
+--------------------+--------+
| words|Negative|
+--------------------+--------+
|tom and jerry went t| 45|
|she was angry when g| 12|
|arnold became sad at| 54|
您的 neg
列表包含对正则表达式模式具有特殊含义的字符,因此,您的模式成为无法解析的正则表达式模式。
您可以使用 re.escape() 函数对模式中的特殊字符进行转义。
我正在尝试使用单词列表进行情绪分析,以获取 pyspark 数据框列中正面和负面单词的数量。我可以使用相同的方法成功获得正面词的计数,并且该列表中大约有 2k 个正面词。负面清单的字数大约是原来的两倍(约 4k 个字)。是什么导致了这个问题,我该如何解决?
我认为这不是代码的原因,因为它对正面词有效,但我很困惑我正在搜索的词数是否在另一个列表中太长,或者什么我不见了。下面是一个示例(不是确切的列表):
stories.show()
+--------------------+
| words|
+--------------------+
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
+--------------------+
neg = ['angry','sad','sorrowful','angry']
#doing some counting manipulation here
df3.show()
错误:
spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308 answer = self.gateway_client.send_command(command)
1309 return_value = get_return_value(
-> 1310 answer, self.gateway_client, self.target_id, self.name)
1311
1312 for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib/python3.7/re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
p = _parse_sub(source, pattern, True, 0)
File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
source.tell() - here + len(this))
re.error: multiple repeat at position 5
预期输出:
+--------------------+--------+
| words|Negative|
+--------------------+--------+
|tom and jerry went t| 45|
|she was angry when g| 12|
|arnold became sad at| 54|
您的 neg
列表包含对正则表达式模式具有特殊含义的字符,因此,您的模式成为无法解析的正则表达式模式。
您可以使用 re.escape() 函数对模式中的特殊字符进行转义。