python 拆分文本文件功能

Question

我写了一个 tokenize 函数，基本上读取字符串表示并将其拆分为单词列表。

我的代码：

def tokenize(document):
    x = document.lower() 
    return re.findall(r'\w+', x)

我的输出：

tokenize("Hi there. What's going on? first-class")
['hi', 'there', 'what', 's', 'going', 'on', 'first', 'class']

期望的输出：

['hi', 'there', "what's", 'going', 'on', 'first-class']

基本上，我希望带撇号的单词和连字符在列表中保留为单个单词，并带有双引号。我怎样才能改变我的功能以获得所需的输出。

Answer 1

\w+ 匹配一个或多个 word 字符；这不包括撇号或连字符。

你需要在这里使用一个character set来告诉Python你想要匹配的确切内容：

>>> import re
>>> def tokenize(document):
...     return re.findall("[A-Za-z'-]+", document)
...
>>> tokenize("Hi there. What's going on? first-class")
['hi', 'there', "what's", 'going', 'on', 'first-class']
>>>

您也会注意到我删除了 x = document.lower() 行。这不再是必需的，因为我们可以通过简单地向字符集添加 A-Z 来匹配大写字符。

python 拆分文本文件功能

python split a text file function

python

regex

split

function

list