Stanford Word Segmenter for Chinese in Python 如何 return 结果没有标点符号

Question

我正在尝试用 Python 中的 Stanford Word Segmenter 对中文句子进行切分，但目前结果中有标点符号。我想 return 结果没有标点符号，只有单词。最好的方法是什么？我尝试使用谷歌搜索寻找答案，但没有找到任何东西。

Answer 1

我认为你最好在分割文本后删除标点符号；我相当确定 Stanford 分段器在完成其工作时会从标点符号中获取线索，因此您不想事先这样做。以下适用于 UTF-8 文本。对于中文标点符号，使用带正则表达式的 Zhon 库：

import zhon.hanzi
import re
h_regex = re.compile('[%s]' % zhon.hanzi.punctuation)
intxt = # segmented text with punctuation
outtxt = h_regex.sub('', intxt)

根据您使用的文本，您可能还需要删除非中文标点符号：

import string
p_regex = re.compile('[%s]' % re.escape(string.punctuation))
outtext2 = p_regex.sub('', outtxt)

那你应该是金色的

Stanford Word Segmenter for Chinese in Python 如何 return 结果没有标点符号

Stanford Word Segmenter for Chinese in Python how to return results without punctuation

python

punctuation

stanford-nlp

chinese-locale