将文本文件转换为字符串 (Python 3)

Turning text file into a string (Python 3)

我想将一个文本文件转换成一个字符串,我得到了这个函数,它写在 Python 2:

def parseOutText(f):
    f.seek(0)  
    all_text = f.read()

    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)

        words = text_string

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)

    return words

如你所见,我必须向这个函数添加一些代码,但它没有编译(编译器给出错误,说 'string' 没有函数 'maketrans')。我确信这段代码可以很容易地翻译成 Python 3 但直到注释行我才真正理解它的作用。它只是简单地省略标点符号并将文本转换为字符串吗?

所以我找到了这段代码,它非常有用:

exclude = set(string.punctuation)
string = ''.join(ch for ch in string if ch not in exclude)

Python 3.x maketrans and translate 具有其 Python 2 个前任的所有基本功能,甚至更多 — 但它们具有不同的 API。所以,你必须了解他们在做什么才能使用它们。

translate in 2.x took a very simple table, make by string.maketrans,加上单独的 deletechars 列表。

在3.x中,table更复杂(很大程度上是因为它现在翻译的是Unicode字符,而不是字节,但它还有其他新功能)。 table 是由静态方法 str.maketrans 而不是函数 string.maketrans 生成的。 table 包含删除列表,因此您不需要 translate.

的单独参数

来自文档:

static str.maketrans(x[, y[, z]])

This static method returns a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters (strings of length 1) to Unicode ordinals, strings (of arbitrary lengths) or None. Character keys will then be converted to ordinals.

If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.


因此,要制作一个 table 删除所有标点符号并且在 3.x 中不执行任何其他操作,您可以这样做:

table = str.maketrans('', '', string.punctuation)

并应用它:

translated = s.translate(table)

同时,由于您正在处理 Unicode,您确定 string.punctuation 是您想要的吗?正如 the docs 所说,这是:

String of ASCII characters which are considered punctuation characters in the C locale.

因此,例如,在非英语语言中使用的弯引号、标点符号等将不会被删除。

如果这是一个问题,您必须执行以下操作:

translated = ''.join(ch for ch in s if unicodedata.category(ch)[0] != 'P')

更改此行

text_string = content[1].translate(string.maketrans("", ""), string.punctuation)'

至此

text_string = content[1].translate((str.maketrans("", ""), string.punctuation)) '