Python 从 Unicode 字符串中删除标点符号(撇号除外)
Python removing punctuation from unicode string except apostrophe
我找到了几个主题,我找到了这个解决方案:
sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
这应该删除除 ' 之外的所有标点符号,问题是它还删除了句子中的所有其他内容。
示例:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
当然我要的是不加标点符号,"warhol's"保持原样
期望的输出:
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
编辑:
我也试过使用
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
但这会去掉每个标点符号
指定您不想删除的所有元素,即\w
、\d
、\s
等。这就是^
运算符表示在方括号中。 (匹配除了)
>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>
我找到了几个主题,我找到了这个解决方案:
sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
这应该删除除 ' 之外的所有标点符号,问题是它还删除了句子中的所有其他内容。
示例:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
当然我要的是不加标点符号,"warhol's"保持原样
期望的输出:
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
编辑: 我也试过使用
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
但这会去掉每个标点符号
指定您不想删除的所有元素,即\w
、\d
、\s
等。这就是^
运算符表示在方括号中。 (匹配除了)
>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>