Beautifulsoup 提取 BR 之间的字符串,但包括 <B>string</B>
Beautifulsoup extract string between BR but include <B>string</B>
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
我正在尝试让整行像
Soccer : <b>11</b>
到目前为止,我正在尝试使用此代码
for br in body.findAll('br'):
following = br.nextSibling
print following.strip()
但它只产生
Soccer:
Volley Ball:
Basketball:
Tennis:
您可以使用您已经开始使用的类似方法解决此问题,或者使用 regular expression
。
选项#1
from bs4 import BeautifulSoup
html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""
body = BeautifulSoup(html, 'lxml')
between_br = []
for br in body.findAll('br'):
following = br.nextSibling
if following == '\n':
continue
sport = following.strip()
score = str(following.next_element)
combined = ' '.join((sport, score))
between_br.append(combined)
print '\n'.join(between_br)
选项 #2
import re
html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""
sports_regex = re.compile(r"""
(?!<br>) # Skip <br> tag
(.* # Match any character
:\s # Match a colon followed by a whitespace
.*) # Match any character
""", re.VERBOSE)
sports = sports_regex.findall(html)
print '\n'.join([s.replace('\n', ' ') for s in sports])
两种方法都将打印:
Soccer: <b>11</b>
Volley Ball: <b>5</b>
Basketball: <b>5</b>
Tennis: <b>2</b>
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
我正在尝试让整行像
Soccer : <b>11</b>
到目前为止,我正在尝试使用此代码
for br in body.findAll('br'):
following = br.nextSibling
print following.strip()
但它只产生
Soccer:
Volley Ball:
Basketball:
Tennis:
您可以使用您已经开始使用的类似方法解决此问题,或者使用 regular expression
。
选项#1
from bs4 import BeautifulSoup
html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""
body = BeautifulSoup(html, 'lxml')
between_br = []
for br in body.findAll('br'):
following = br.nextSibling
if following == '\n':
continue
sport = following.strip()
score = str(following.next_element)
combined = ' '.join((sport, score))
between_br.append(combined)
print '\n'.join(between_br)
选项 #2
import re
html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""
sports_regex = re.compile(r"""
(?!<br>) # Skip <br> tag
(.* # Match any character
:\s # Match a colon followed by a whitespace
.*) # Match any character
""", re.VERBOSE)
sports = sports_regex.findall(html)
print '\n'.join([s.replace('\n', ' ') for s in sports])
两种方法都将打印:
Soccer: <b>11</b>
Volley Ball: <b>5</b>
Basketball: <b>5</b>
Tennis: <b>2</b>