给定一个 html 段落和一个 link,有没有办法检索 Python 段落内 link 之前和之后的文本?

Given an html paragraph and a link, is there a way to retrieve the text before and the text after the link inside the paragraph in Python?

我正在使用 urllib3 获取某些页面的 html。

我想从link所在的段落中取出文本,link前后的文本分开存储。

例如:

import urllib3
from bs4 import BeautifulSoup

http = urllib3.PoolManager()
r = http.request('get', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
    if a.has_attr('href'):
        if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
            link_text = a
            link_para = a.find_parent("p")
            print(link_text)
            print(link_para)

段落

<p>The message quoted above about Michael Novenche, a two-year-old boy 
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with 
all the changes in his condition proved a challenge.  The message quoted above 
stated that Michael had a large tumor in his brain, was operated upon to 
remove part of the tumor, and needed prayers to help him through chemotherapy 
to a full recovery.  An <nobr>October 2000</nobr> article in <a 
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general" 
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany 
Weekly</i></a> didn’t mention anything about little Michael’s medical 
condition but said that his family was “in need of funds to help pay for the
 transportation to the hospital and other costs not covered by their 
insurance.”  A June 2000 message posted to the <a 
href="http://www.ecunet.org/whatisecupage.html" 
onmouseout="window.status='';return true" 
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a> 
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old, 
mentioned that his tumor appeared to be shrinking, and provided a mailing 
address for him:</p>

Link

<a href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The 
Local Albany Weekly';return true" target="_blank"><i>The Local Albany 
Weekly</i></a>

要检索的文本(2 部分)

The message quoted above about Michael Novenche, a two-year-old boy 
undergoing chemotherapy ... was operated upon to 
remove part of the tumor, and needed prayers to help him through chemotherapy 
to a full recovery.  An October 2000 article in
didn’t mention anything about little Michael’s medical 
condition but said that his family was ... turned 3 years old, 
mentioned that his tumor appeared to be shrinking, and provided a mailing 
address for him:

我不能简单地 get_text() 然后使用拆分,因为 link 文本可能会重复。

我想我可能只是添加一个计数器来查看 link 文本重复了多少次,使用 split(),然后使用循环来获取我想要的部分。

不过,我希望有更好、更简洁的方法。

您可以迭代 a 标记父项的内容并比较实际值是否是我们的 a 标记。如果是,我们找到一个部分并继续构建另一个:

data = '''<p>The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge.  The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery.  An <nobr>October 2000</nobr> article in <a
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a> didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
 transportation to the hospital and other costs not covered by their
insurance.”  A June 2000 message posted to the <a
href="http://www.ecunet.org/whatisecupage.html"
onmouseout="window.status='';return true"
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a>
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:</p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

link_url='http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general'
a = soup.find('a', href=link_url)

s, parts = '', []
for t in a.parent.contents:
    if t == a:
        parts += [s]
        s = ''
        continue
    s += str(t)
parts += [s]

for part in parts:
    print(BeautifulSoup(part, 'lxml').body.text.strip())
    print('*' * 80)

打印:

The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge.  The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery.  An October 2000 article in
********************************************************************************
didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
 transportation to the hospital and other costs not covered by their
insurance.”  A June 2000 message posted to the Ecunet
mailing list indicated that Michael had just turned 3 years old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:
********************************************************************************

你能澄清一下你的意思吗:

I cant simply get_text() then use split as the link text might be repeated

当我运行:

import urllib3
from bs4 import BeautifulSoup
import certifi

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())

r = http.request('GET', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
    if a.has_attr('href'):
        if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
            link_text = a
            link_para = a.find_parent("p")
            print(link_para.get_text())

我得到:

The message quoted above about Michael Novenche, a two-year-old boy undergoing chemotherapy to treat a brain tumor, was real, but keeping up with all the changes in his condition proved a challenge. The message quoted above stated that Michael had a large tumor in his brain, was operated upon to remove part of the tumor, and needed prayers to help him through chemotherapy to a full recovery. An October 2000 article in The Local Albany Weekly didn’t mention anything about little Michael’s medical condition but said that his family was “in need of funds to help pay for the transportation to the hospital and other costs not covered by their insurance.” A June 2000 message posted to the Ecunet mailing list indicated that Michael had just turned 3 years old, mentioned that his tumor appeared to be shrinking, and provided a mailing address for him:

文本被 'The Local Albany Weekly' 分割,它是 link 的名称。那么为什么不获取 link 名称并以此分割呢?

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())

r = http.request('GET', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
    if a.has_attr('href'):
        if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
            link_text = a
            link_para = a.find_parent("p")
            the_link = link_para.find('a')
            #change the name of <i> to something unique
            the_link.string.replace_with('ooqieri')
            name_link = link_text.findAll('i')[0].get_text()
            full_text = link_para.get_text().split(name_link)
            print(full_text)

给出:

['The message quoted above about Michael Novenche, a two-year-old boy undergoing chemotherapy to treat a brain tumor, was real, but keeping up with all the changes in his condition proved a challenge. The message quoted above stated that Michael had a large tumor in his brain, was operated upon to remove part of the tumor, and needed prayers to help him through chemotherapy to a full recovery. An October 2000 article in ', ' didn’t mention anything about little Michael’s medical condition but said that his family was “in need of funds to help pay for the transportation to the hospital and other costs not covered by their insurance.” A June 2000 message posted to the Ecunet mailing list indicated that Michael had just turned 3 years old, mentioned that his tumor appeared to be shrinking, and provided a mailing address for him:']

您可以使用 bs4 4.7.1 轻松完成此操作。使用 :has 和属性 = 值选择器获取父 p 标签,然后将其拆分为 a 标签 html 上的 html。然后 re-parse 和 p 标签的 bs。这解决了潜在的重复短语问题。仅当 a 标记的整个 html 可能在块中重复出现时才会出现问题,这似乎不太可能。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.snopes.com/fact-check/michael-novenche/')
soup = bs(r.content, 'lxml')
data = soup.select_one('p:has(>[href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"])').encode_contents().split(soup.select_one('[href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"]').encode_contents())
items = [bs(i, 'lxml').select_one('p').text for i in data]
print(items)

我找到了一个基于@Andrej kesely 的解决方案的解决方案。

它处理两个问题:

  1. 没有文字before/after link

  2. link 不是该段落的直接子项

这是(作为函数):

import urllib3
from bs4 import BeautifulSoup
import lxml

def get_info(page,link):
    r = http.request('get', page)
    body = r.data
    soup = BeautifulSoup(body, 'lxml')
    a = soup.find('a', href=link)
    s, parts = '', []

    if a.parent.name=="p":
        for t in a.parent.contents:
            if t == a:
                parts += [s]
                s = ''
                continue
            s += str(t)
        parts += [s]
    else:
        prnt = a.find_parents("p")[0]
        for t in prnt.contents:
            if t == a or (str(a) in str(t)):
                parts+=[s]
                s=''
                continue
            s+=str(t)
        parts+=[s]

    try:
        text_before_link = BeautifulSoup(parts[0], 'lxml').body.text.strip()
    except AttributeError as error:
        text_before_link = ""

    try:
        text_after_link = BeautifulSoup(parts[1], 'lxml').body.text.strip()
    except AttributeError as error:
        text_after_link = ""

    return text_before_link, text_after_link

这假定另一个段落中没有段落。

如果有人对此失败的场景有任何想法,请随时提出。