如何在正则表达式中匹配单双 html 属性?
How can I match single and double html attributes in regular expression?
我有一个包含两个不同 class 元素的 html 元素。但在某些情况下,我只有一个 class。当有两个 class 时,它们被 space.
分隔
"rating-inbtn hide-if-zero-113"
or
"rating-inbtn"
如何在正则表达式中匹配两种模式。
作为参考,我想放一张我的旧 post :
<span class="vote-actions">
<a class="btn btn-default vote-action-good">
<span class="icon thumb-up black black-hover"> </span>
<span class="rating-inbtn">215</span>
</a>
<a class="btn btn-default vote-action-bad">
<span class="icon thumb-down grey black-hover"> </span>
<span class="rating-inbtn">82</span>
</a>
</span>
我使用这个正则表达式来提取评分
a = re.findall('rating-inbtn">(.*?)</span>', webpage)
like_count = a[0]
dislike_count = a[1]
但有时 span class 有多个属性 "hide-if-zero-113" 在这种情况下我该如何处理这种模式?
谢谢
这取决于您希望添加到表达式的边界。例如,我们可以从:
开始
\s*([a-z0-9-]+)(?:\s+)?([a-z0-9-]+)?\s*
表达式在 this demo, if you wish to explore further or modify it, and in this link 的右上面板进行了解释,如果您愿意,可以逐步观察它如何与一些示例输入匹配。
编辑:
要获取这些评分,这个表达式可能就足够了:
rating-inbtn[^>]+>\s*([^\s<]+)\s*<\/
Demo
测试re.findall
import re
regex = r"rating-inbtn[^>]+>\s*([^\s<]+)\s*<\/"
test_str = ("<span class=\"vote-actions\">\n"
" <a class=\"btn btn-default vote-action-good\">\n"
" <span class=\"icon thumb-up black black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">215</span>\n"
" </a>\n"
" <a class=\"btn btn-default vote-action-bad\">\n"
" <span class=\"icon thumb-down grey black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">82</span>\n"
"<span class=\"rating-inbtn\"> 74 </span>\n"
"<span class=\"rating-inbtn hide-if-zero-113\"> 99 </span>\n"
" </a>\n"
"</span>")
print(re.findall(regex, test_str))
输出
['215', '82', '74', '99']
测试re.finditer
import re
regex = r"rating-inbtn[^>]+>\s*([^\s<]+)\s*<\/"
test_str = ("<span class=\"vote-actions\">\n"
" <a class=\"btn btn-default vote-action-good\">\n"
" <span class=\"icon thumb-up black black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">215</span>\n"
" </a>\n"
" <a class=\"btn btn-default vote-action-bad\">\n"
" <span class=\"icon thumb-down grey black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">82</span>\n"
"<span class=\"rating-inbtn\"> 74 </span>\n"
"<span class=\"rating-inbtn hide-if-zero-113\"> 99 </span>\n"
" </a>\n"
"</span>")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
也许我遗漏了什么,但您不需要正则表达式来从代码中提取数字:
data = '''<span class="vote-actions">
<a class="btn btn-default vote-action-good">
<span class="icon thumb-up black black-hover"> </span>
<span class="rating-inbtn">215</span>
</a>
<a class="btn btn-default vote-action-bad">
<span class="icon thumb-down grey black-hover"> </span>
<span class="rating-inbtn">82</span>
</a>
</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print([span.text for span in soup.select('span.rating-inbtn')])
打印:
['215', '82']
我将详细说明给出的其他答案之一。在下面的示例中,您正在查看两个共享同一个 class 的元素,这应该足以匹配两个元素。你最上面的例子显示了一个复合 class (元素的多个 class 名称)但它再次共享相同的 class of rating-inbtn
.
soup.select('.rating-inbtn')
其中“.”是一个 css class 选择器。
扩展其他答案:
将来您可以传递以“,”分隔的列表以匹配多个 class(实际上是多个选择器),例如
soup.select('.rating-inbtn, .otherClass')
我有一个包含两个不同 class 元素的 html 元素。但在某些情况下,我只有一个 class。当有两个 class 时,它们被 space.
分隔"rating-inbtn hide-if-zero-113"
or
"rating-inbtn"
如何在正则表达式中匹配两种模式。
作为参考,我想放一张我的旧 post :
<span class="vote-actions">
<a class="btn btn-default vote-action-good">
<span class="icon thumb-up black black-hover"> </span>
<span class="rating-inbtn">215</span>
</a>
<a class="btn btn-default vote-action-bad">
<span class="icon thumb-down grey black-hover"> </span>
<span class="rating-inbtn">82</span>
</a>
</span>
我使用这个正则表达式来提取评分
a = re.findall('rating-inbtn">(.*?)</span>', webpage)
like_count = a[0]
dislike_count = a[1]
但有时 span class 有多个属性 "hide-if-zero-113" 在这种情况下我该如何处理这种模式?
谢谢
这取决于您希望添加到表达式的边界。例如,我们可以从:
开始\s*([a-z0-9-]+)(?:\s+)?([a-z0-9-]+)?\s*
表达式在 this demo, if you wish to explore further or modify it, and in this link 的右上面板进行了解释,如果您愿意,可以逐步观察它如何与一些示例输入匹配。
编辑:
要获取这些评分,这个表达式可能就足够了:
rating-inbtn[^>]+>\s*([^\s<]+)\s*<\/
Demo
测试re.findall
import re
regex = r"rating-inbtn[^>]+>\s*([^\s<]+)\s*<\/"
test_str = ("<span class=\"vote-actions\">\n"
" <a class=\"btn btn-default vote-action-good\">\n"
" <span class=\"icon thumb-up black black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">215</span>\n"
" </a>\n"
" <a class=\"btn btn-default vote-action-bad\">\n"
" <span class=\"icon thumb-down grey black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">82</span>\n"
"<span class=\"rating-inbtn\"> 74 </span>\n"
"<span class=\"rating-inbtn hide-if-zero-113\"> 99 </span>\n"
" </a>\n"
"</span>")
print(re.findall(regex, test_str))
输出
['215', '82', '74', '99']
测试re.finditer
import re
regex = r"rating-inbtn[^>]+>\s*([^\s<]+)\s*<\/"
test_str = ("<span class=\"vote-actions\">\n"
" <a class=\"btn btn-default vote-action-good\">\n"
" <span class=\"icon thumb-up black black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">215</span>\n"
" </a>\n"
" <a class=\"btn btn-default vote-action-bad\">\n"
" <span class=\"icon thumb-down grey black-hover\"> </span>\n"
" <span class=\"rating-inbtn\">82</span>\n"
"<span class=\"rating-inbtn\"> 74 </span>\n"
"<span class=\"rating-inbtn hide-if-zero-113\"> 99 </span>\n"
" </a>\n"
"</span>")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
也许我遗漏了什么,但您不需要正则表达式来从代码中提取数字:
data = '''<span class="vote-actions">
<a class="btn btn-default vote-action-good">
<span class="icon thumb-up black black-hover"> </span>
<span class="rating-inbtn">215</span>
</a>
<a class="btn btn-default vote-action-bad">
<span class="icon thumb-down grey black-hover"> </span>
<span class="rating-inbtn">82</span>
</a>
</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print([span.text for span in soup.select('span.rating-inbtn')])
打印:
['215', '82']
我将详细说明给出的其他答案之一。在下面的示例中,您正在查看两个共享同一个 class 的元素,这应该足以匹配两个元素。你最上面的例子显示了一个复合 class (元素的多个 class 名称)但它再次共享相同的 class of rating-inbtn
.
soup.select('.rating-inbtn')
其中“.”是一个 css class 选择器。
扩展其他答案:
将来您可以传递以“,”分隔的列表以匹配多个 class(实际上是多个选择器),例如
soup.select('.rating-inbtn, .otherClass')