BeautifulSoup 获取字符串之间的链接

BeautifulSoup get links between strings

所以我正在使用 BS4 从网站中获取以下内容:

<div>Some TEXT with <a href="some Link">some LINK</a>
and some continuing TEXT with following <a href="some Link">some LINK</a> inside.</div>

我需要得到的是:

"Some TEXT with some LINK ("https// - actual Link") and some continuing TEXT with following some LINK ("https//- next Link") inside."

我为此苦苦挣扎了一段时间,不知道如何到达那里...在 [:] 之前、之后、之间尝试过各种数组内传递方法来将所有内容组合在一起.

我希望有人能帮助我,因为我是 Python 的新手。提前致谢。

您可以使用 str.join 迭代 soup.contents:

import bs4
html = '''<div>Some TEXT with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)

输出:

'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'

编辑:忽略 br 标签:

html = '''<div>Some TEXT <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
    if getattr(i, 'name', None) != 'br')

编辑 2:递归解决方案:

def form_text(s):
  if isinstance(s, (str, bs4.element.NavigableString)):
    yield s
  elif s.name == 'a':
     yield f'{s.get_text(strip=True)} ({s["href"]})'
  else:
     for i in getattr(s, 'contents', []):
        yield from form_text(i)

html = '''<div>Some TEXT <i>other text in </i> <br> with <a href="https// - actual Link">some LINK</a> and some continuing TEXT with <br> following <a href="https//- next Link">some LINK</a> inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))

输出:

Some TEXT  other text in     with  some LINK (https// - actual Link)  and some continuing TEXT with   following  some LINK (https//- next Link)  inside.

此外,由于存在 br 标记等原因,空格可能会成为一个问题。要解决此问题,您可以使用 re.sub:

import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))

输出:

'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'