text.replace(punctuation,'') 不删除 list(punctuation) 中包含的所有标点符号?

text.replace(punctuation,'') does not remove all punctuation contained in list(punctuation)?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p

# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'

# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()

# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)

# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()

# for further text analysis, remove punctuation
for punct in list(p):
    obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)

# store individual words
words = obama_4427_str_processed.split(' ')
print(words)

长话短说,我有奥巴马总统的演讲,我想删除所有标点符号,这样我就只剩下文字了。我已经导入了 punctuation 模块,运行 一个 for 循环,它没有删除我所有的标点符号。我在这里做错了什么?

str.replace() 搜索 第一个参数 的完整值。它不是一个模式,所以只有当 whole `string.punctuation* 值存在时,它才会被空字符串替换。

改用正则表达式:

import re
from string import punctuation as p

punctuation = re.compile('[{}]+'.format(re.escape(p)))

obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()

请注意,您可以只使用 str.split() 而不带参数来分割任意宽度的空白,包括换行符。

如果你想删除标点符号,你可以rstrip关闭它:

obama_4427_str = obama_4427_div.text.lower()

# for further text analysis, remove punctuation
from string import punctuation
print([w.rstrip(punctuation) for w in obama_4427_str.split()])

输出:

['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great', 
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow', 
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound', 
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your', 
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
................................................................

使用 python3 从任何地方删除使用 str.translate:

from string import punctuation
tbl = str.maketrans({ord(ch):"" for ch in punctuation})
obama_4427_str = obama_4427_div.text.lower().translate(tbl)
print(obama_4427_str.split())

对于python2:

from string import punctuation
obama_4427_str = obama_4427_div.text.lower().encode("utf-8").translate(None,punctuation)
print( obama_4427_str.split())

输出:

['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great', 
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow', 
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound', 
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your', 
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
............................................................

另一方面,您可以遍历字符串,因此 list(p) 在您自己的代码中是多余的。