不包括换行符的正则表达式

Question

我有一个简单的单词计数器，只有一个例外。它在 \n 字符上拆分。

小示例文本文件是：

'''
A tree is a woody perennial plant,typically with branches.
I added this second line,just to add eleven more words.
'''

第 1 行有 10 个单词，第 2 行有 11 个。总字数 = 21.

此代码产生的计数为 22，因为它在第 1 行的末尾包含了 \n 字符：

import re


testfile = "d:\python\workbook\words2.txt"

number_of_words = 0

with open(testfile, "r") as datafile:
    for line in datafile:
        number_of_words += len(re.split(",|\s", line))

print(number_of_words)

如果我将正则表达式更改为：number_of_words += len(re.split(",|^\n|\s", line)) 字数（22）保持不变。

我的问题是：为什么 exclude newline [^\n] 失败了，或者更广泛地说，是什么应该是对我的正则表达式进行编码的正确方法，以便我排除尾随 \n 并让上面的代码到达正确的单词总数 21.

Answer 1

您可以简单地使用：

number_of_words = 0
with open(testfile, "r") as datafile:
    for line in datafile:
        number_of_words += len(re.findall('\w+', line)

不包括换行符的正则表达式

regex excluding newline

regex

python-3.6