如何删除文本中单词末尾可能出现的数字

Question

我有要使用正则表达式清理的文本数据。但是，文本中的某些单词后面紧跟着我要删除的数字。

例如一行文字是：

Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons learnt from the RUPES project12 Payment for environmental service and it potential and example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32

以上文字的第一个词应该是'preface'而不是'preface2'等等。

line = re.sub(r"[A-Za-z]+(\d+)", "", line)

然而，这删除了单词以及所见：

Pes Lessons learnt from the RUPES Payment for environmental service and it potential and example in Chapter Integrating payment for ecosystem service into Vietnams policy and Chapter Creating incentive for Tri An watershed Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong district of Hoa Binh province Chapter 5 Local revenue sharing Nha Trang Bay Marine Protected Area Synthesis and

如何才能只捕获紧跟在单词后面的数字？

Answer 1

您可以尝试先行断言来检查数字之前的单词。在强制正则表达式仅匹配单词末尾的数字的末尾尝试单词边界 (\b)：

re.sub(r'(?<=\w+)\d+\b', '', line)

希望对您有所帮助

编辑：对不起，在评论中提到的有关匹配数字的评论中也没有以单词开头的小故障。那是因为（再次抱歉） \w 匹配字母数字字符而不仅仅是字母字符。根据您要删除的内容，您可以使用正版本

re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)

只检查数字或负数之前的英文字母字符（您可以将字符添加到 [a-zA-Z] 列表）

re.sub(r'(?<![\d\s])\d+\b', '', line)

匹配您想要的数字之前不是 \d（数字）或 \s（空格）的任何内容。不过，这也会匹配标点符号。

Answer 2

您也可以创建一个数字范围：

re.sub(r"[0-9]", "", line)

Answer 3

您可以捕获文本部分并用捕获的部分替换单词。它只是写：

re.sub(r"([A-Za-z]+)\d+", r"", line)

Answer 4

试试这个：

line = re.sub(r"([A-Za-z]+)(\d+)", "\2", line) #just keep the number    
line = re.sub(r"([A-Za-z]+)(\d+)", "\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"", line) #same as first one    
line = re.sub(r"([A-Za-z]+)(\d+)", r"", line) #same as second one

\\1 匹配单词，\\2 匹配数字。参见：How to use python regex to replace using captured group?

Answer 5

下面，我提出了一个可能会解决您的问题的工作代码示例。

这是片段：

import re

# I'will write a function that take the test data as input and return the
# desired result as stated in your question.

def transform(data):
    """Replace in a text data words ending with number.""""
    # first, lest construct a pattern matching those words we're looking for
    pattern1 = r"([A-Za-z]+\d+)"

    # Lest construct another pattern that will replace the previous in the final
    # output.
    pattern2 = r"\d+$"

    # Let find all matching words
    matches = re.findall(pattern1, data)

    # Let construct a list of replacement for each word
    replacements = []
    for match in matches:
        replacements.append(pattern2, '', match)

    # Intermediate variable to construct tuple of (word, replacement) for
    # use in string method 'replace'
    changers = zip(matches, replacements)

    # We now recursively change every appropriate word matched.
    output = data
    for changer in changers:
        output.replace(*changer)

    # The work is done, we can return the result
    return output

出于测试目的，我们运行以上函数与您的测试数据：

data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons     
learnt from the RUPES project12 Payment for environmental service and it potential and 
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams 
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter 
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao 
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang 
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""

result = transform(data)

print(result)

结果如下所示：

Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from 
the RUPES project Payment for environmental service and it potential and example in 
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and 
programmes Chapter Creating incentive for Tri An watershed protection Chapter 
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building 
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong 
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay 
Marine Protected Area Vietnam Synthesis and Recommendations References

如何删除文本中单词末尾可能出现的数字

How can I remove numbers that may occur at the end of words in a text

python

regex

regex-group