使用正则表达式从 html 标签中提取文本

Question

我的 HTML 文本看起来像这样..我想在 python 中使用 REGEX 从 HTML 文本中仅提取纯文本（不使用 HTML 解析器）

&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;

如何找到准确的正则表达式来获取纯文本？

Answer 1

您可以使用 Javascript 通过简单的 selector 方法执行此操作，然后检索 .innerHTML 属性.

//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML 
let text = div[0].innerHTML;

这将 select 您要检索其 HTML 的元素，然后它将提取内部 HTML 文本，假设您只想要 [=20= 之间的内容] 标签而不是标签本身。

为此不需要正则表达式。你必须用JS或一些后端来实现Regex，只要你能在你的项目中插入一个JS脚本，那么你就可以获得内部HTML。

如果您正在抓取数据，无论使用何种语言，您的图书馆很可能都有 select 或方法来轻松检索 HTML 文本，而无需正则表达式。

Answer 2

您最好在此处使用解析器：

import html, xml.etree.ElementTree as ET

# decode
string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

这会产生

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

显然，您可能想更改 xpath，从而得到 look at the possibilities。

附录：

这里可以使用正则表达式，但是这种方法确实容易出错，不可取:

import re

string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

这个想法是寻找一个大写字母并将单词字符、空格和逗号匹配到一个点。参见 a demo on regex101.com。

使用正则表达式从 html 标签中提取文本

extract text from html tags using regex

html

regex

regular-language

python-3.x

附录：