替换 HTML 但将单词保留在末尾

Question

我正在处理文本数据，我想删除任何 HTML 包含“<”和“>”的代码。例如

<< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) 是全国性的劳动力雇佣和采购机构`

所以我使用下面的代码

def remove_html(s):
    
    s = re.sub('[^\S]*<[^\S]*', "", s)
    s = re.sub('[^\S]*>[^\S]*', "", s)
    return s

通过执行代码，我们得到以下结果

Solutions Australia LSA 是全国性的劳动力雇佣和采购

我不想删除 Labor 这个词，但它被删除了，因为它坚持使用“>”。有什么办法可以挽救它吗？请推荐

Answer 1

import re
def remove_html(data):
    return re.sub('<[^>]+>', '', data).strip()

test_case = '< HTML > < p style="text-align:justify" >Labour Solutions Australia (LSA) is a national labour hire and sourcing'
print(remove_html(test_case))

输出：

Labour Solutions Australia (LSA) 是全国性的劳动力雇佣和采购机构

替换 HTML 但将单词保留在末尾

Replacing HTML but saving the word sticking at the end

python

dataframe

data-cleaning

python-re