BeautifulSoup 获取字符串之间的链接
BeautifulSoup get links between strings
所以我正在使用 BS4 从网站中获取以下内容:
<div>Some TEXT with <a href="some Link">some LINK</a>
and some continuing TEXT with following <a href="some Link">some LINK</a> inside.</div>
我需要得到的是:
"Some TEXT with some LINK ("https// - actual Link") and some continuing TEXT with following some LINK ("https//- next Link") inside."
我为此苦苦挣扎了一段时间,不知道如何到达那里...在 [:] 之前、之后、之间尝试过各种数组内传递方法来将所有内容组合在一起.
我希望有人能帮助我,因为我是 Python 的新手。提前致谢。
您可以使用 str.join
迭代 soup.contents
:
import bs4
html = '''<div>Some TEXT with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
输出:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
编辑:忽略 br
标签:
html = '''<div>Some TEXT <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
编辑 2:递归解决方案:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
输出:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
此外,由于存在 br
标记等原因,空格可能会成为一个问题。要解决此问题,您可以使用 re.sub
:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
输出:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
所以我正在使用 BS4 从网站中获取以下内容:
<div>Some TEXT with <a href="some Link">some LINK</a>
and some continuing TEXT with following <a href="some Link">some LINK</a> inside.</div>
我需要得到的是:
"Some TEXT with some LINK ("https// - actual Link") and some continuing TEXT with following some LINK ("https//- next Link") inside."
我为此苦苦挣扎了一段时间,不知道如何到达那里...在 [:] 之前、之后、之间尝试过各种数组内传递方法来将所有内容组合在一起.
我希望有人能帮助我,因为我是 Python 的新手。提前致谢。
您可以使用 str.join
迭代 soup.contents
:
import bs4
html = '''<div>Some TEXT with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
输出:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
编辑:忽略 br
标签:
html = '''<div>Some TEXT <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
编辑 2:递归解决方案:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
输出:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
此外,由于存在 br
标记等原因,空格可能会成为一个问题。要解决此问题,您可以使用 re.sub
:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
输出:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'