Beautifulsoup 提取 BR 之间的字符串，但包括 <B>string</B>

Question

<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>

我正在尝试让整行像

Soccer : <b>11</b>

到目前为止，我正在尝试使用此代码

for br in body.findAll('br'):
    following = br.nextSibling
    print following.strip()

但它只产生

Soccer:
Volley Ball:
Basketball:
Tennis:

Answer 1

您可以使用您已经开始使用的类似方法解决此问题，或者使用 regular expression。

选项#1

from bs4 import BeautifulSoup


html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""

body = BeautifulSoup(html, 'lxml')

between_br = []
for br in body.findAll('br'):
    following = br.nextSibling

    if following == '\n':
        continue

    sport = following.strip()
    score = str(following.next_element)

    combined = ' '.join((sport, score))
    between_br.append(combined)

print '\n'.join(between_br)

选项 #2

import re


html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""

sports_regex = re.compile(r"""
 (?!<br>)  # Skip <br> tag
 (.*       # Match any character
 :\s       # Match a colon followed by a whitespace
 .*)       # Match any character
""", re.VERBOSE)

sports = sports_regex.findall(html)
print '\n'.join([s.replace('\n', ' ') for s in sports])

两种方法都将打印：

Soccer: <b>11</b>
Volley Ball: <b>5</b>
Basketball: <b>5</b>
Tennis: <b>2</b>

Beautifulsoup 提取 BR 之间的字符串，但包括 <B>string</B>

Beautifulsoup extract string between BR but include <B>string</B>

beautifulsoup