使用正则表达式从 html 标签中提取文本
extract text from html tags using regex
我的 HTML 文本看起来像这样..我想在 python 中使用 REGEX 从 HTML 文本中仅提取纯文本(不使用 HTML 解析器)
<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
如何找到准确的正则表达式来获取纯文本?
您可以使用 Javascript 通过简单的 selector 方法执行此操作,然后检索 .innerHTML
属性.
//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML
let text = div[0].innerHTML;
这将 select 您要检索其 HTML 的元素,然后它将提取内部 HTML 文本,假设您只想要 [=20= 之间的内容] 标签而不是标签本身。
为此不需要正则表达式。你必须用JS或一些后端来实现Regex,只要你能在你的项目中插入一个JS脚本,那么你就可以获得内部HTML。
如果您正在抓取数据,无论使用何种语言,您的图书馆很可能都有 select 或方法来轻松检索 HTML 文本,而无需正则表达式。
您最好在此处使用解析器:
import html, xml.etree.ElementTree as ET
# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
# construct the dom
root = ET.fromstring(html.unescape(string))
# search it
for p in root.findall("*"):
print(p.text)
这会产生
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
显然,您可能想更改 xpath
,从而得到 look at the possibilities。
附录:
这里可以使用正则表达式,但是这种方法确实容易出错,不可取:
import re
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')
print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']
这个想法是寻找一个大写字母并将单词字符、空格和逗号匹配到一个点。参见 a demo on regex101.com。
我的 HTML 文本看起来像这样..我想在 python 中使用 REGEX 从 HTML 文本中仅提取纯文本(不使用 HTML 解析器)
<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
如何找到准确的正则表达式来获取纯文本?
您可以使用 Javascript 通过简单的 selector 方法执行此操作,然后检索 .innerHTML
属性.
//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML
let text = div[0].innerHTML;
这将 select 您要检索其 HTML 的元素,然后它将提取内部 HTML 文本,假设您只想要 [=20= 之间的内容] 标签而不是标签本身。
为此不需要正则表达式。你必须用JS或一些后端来实现Regex,只要你能在你的项目中插入一个JS脚本,那么你就可以获得内部HTML。
如果您正在抓取数据,无论使用何种语言,您的图书馆很可能都有 select 或方法来轻松检索 HTML 文本,而无需正则表达式。
您最好在此处使用解析器:
import html, xml.etree.ElementTree as ET
# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
# construct the dom
root = ET.fromstring(html.unescape(string))
# search it
for p in root.findall("*"):
print(p.text)
这会产生
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
显然,您可能想更改 xpath
,从而得到 look at the possibilities。
附录:
这里可以使用正则表达式,但是这种方法确实容易出错,不可取:
import re
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')
print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']
这个想法是寻找一个大写字母并将单词字符、空格和逗号匹配到一个点。参见 a demo on regex101.com。