从单词列表中删除引号和双引号

removing quotes and double quotes from a list of words

这是我在这个网站上的第一个问题。请原谅我的任何格式或语言错误。所以这个问题是基于艾伦唐尼的一本名为“think python”的书。 activity是写一个python程序,读取文本格式的书,去掉所有空格、制表符、标点符号等空白。我尝试了许多不同的方法来删除标点符号,但它从未删除引号和双引号。他们执着地留下来。我将复制粘贴我试过的最后一个代码。

import string

def del_punctuation(item):
    '''
        This function deletes punctuation from a word.
    '''
    punctuation = string.punctuation
    for c in item:
        if c in punctuation:
            item = item.replace(c, '')
    return item

def break_into_words(filename):
    '''
        This function reads file, breaks it into 
        a list of used words in lower case.
    '''
    book = open(filename)
    words_list = []
    for line in book:
        for item in line.split():
            item = del_punctuation(item)
            item=item.lower()
            #print(item)
            words_list.append(item)
    return words_list

print(break_into_words('input.txt'))

我没有包含删除空格的代码,因为它们工作得很好。我只包含了删除标点符号的代码。除引号和双引号外,所有标点符号都被删除。请帮助我找到代码中的错误,或者它与我的 IDE 或编译器有关吗? 提前致谢

input.txt:

“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”

“What is his name?”

“Bingley.”

“Is he married or single?”

“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”

“How so? how can it affect them?”

“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”

“Is that his design in settling here?”

我得到的输出复制如下:

['“why', 'my', 'dear', 'you', 'must', 'know', 'mrs', 'long', 'says', 'that', 'netherfield', 'is', 'taken', 'by', 'a', 'young', 'man', 'of', 'large', 'fortune', 'from', 'the', 'north', 'of', 'england', 'that', 'he', 'came', 'down', 'on', 'monday', 'in', 'a', 'chaise', 'and', 'four', 'to', 'see', 'the', 'place', 'and', 'was', 'so', 'much', 'delighted', 'with', 'it', 'that', 'he', 'agreed', 'with', 'mr', 'morris', 'immediately', 'that', 'he', 'is', 'to', 'take', 'possession', 'before', 'michaelmas', 'and', 'some', 'of', 'his', 'servants', 'are', 'to', 'be', 'in', 'the', 'house', 'by', 'the', 'end', 'of', 'next', 'week”', '“what', 'is', 'his', 'name”', '“bingley”', '“is', 'he', 'married', 'or', 'single”', '“oh', 'single', 'my', 'dear', 'to', 'be', 'sure', 'a', 'single', 'man', 'of', 'large', 'fortune', 'four', 'or', 'five', 'thousand', 'a', 'year', 'what', 'a', 'fine', 'thing', 'for', 'our', 'girls”', '“how', 'so', 'how', 'can', 'it', 'affect', 'them”', '“my', 'dear', 'mr', 'bennet”', 'replied', 'his', 'wife', '“how', 'can', 'you', 'be', 'so', 'tiresome', 'you', 'must', 'know', 'that', 'i', 'am', 'thinking', 'of', 'his', 'marrying', 'one', 'of', 'them”', '“is', 'that', 'his', 'design', 'in', 'settling', 'here”']

它已经删除了除双引号和单引号之外的所有标点符号(我猜输入中有单引号)。 谢谢

我认为你的文本包含这个字符”作为双引号而不是“。”在 string.punctuation 中不存在,所以你没有删除它。也许最好更改你的 del_punctuation 功能一点:

def del_punctuation(item):
    '''
        This function deletes punctuation from a word.
    '''
    punctuation = string.punctuation
    for c in item:
        if c in punctuation:
            item = item.replace(c, '')
        
    item = item.replace('”','')
    item = item.replace('“','')
    return item

真实文本可能包含太多棘手的符号:n-破折号 、m-破折号 、十多个不同的引号“' ` ' ' “ ” « » ‹› 等等,等等...

尝试计算所有可能的标点符号意义不大。常见的方法是尝试只获取字母(和空格)。最简单的方法是使用 RegExp:

import re

text = '''“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”

“What is his name?”

“Bingley.”

“Is he married or single?”

“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”

“How so? how can it affect them?”

“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”

“Is that his design in settling here?”'''

# remove everything except letters, spaces, \n and, for example, dashes
text = re.sub("[^A-z \n\-]", "", text)

# split the text by spaces and \n
output = text.split()

print(output)

但实际上事情远比乍看起来复杂得多。说I'm是一个二字?大概是这样。 someone's 呢?或者 rock'n'roll.