如何使用 python 和 beautifulsoup 解析 <br/> 标记后的字符串(给定场景中的日期值)

How to parse the string (date value in the given scenario) after the <br/> tag using python and beautifulsoup

目前,我正在尝试使用 Python、BeautifulSoup.

抓取网页内容

在第一个代码块执行后,得到以下结果 -

<div class="some class name">
    <div>
        <h3>Situation reports January 2020</h3>
        <p>
            <a target="_blank" href="/docs/default-source/coronaviruse/situation-reports/20200802-covid-19-sitrep-195.pdf?sfvrsn=5e5da0c5_2">
                <strong>Situation report - 1</strong>
            </a>
            <br>Coronavirus&nbsp;disease 2019 (COVID-19)&nbsp;
            <br>21 January 2020
        </p>
    </div>
</div>

再次执行第2步代码后,结果如下,

<p>
    <a href="/docs/default-source/coronaviruse/situation-reports/20200121-sitrep-1-2019-ncov.pdf?sfvrsn=20a99c10_4" target="_blank">
        <strong>Situation report - 1</strong>
    </a>
    <br/>Novel Coronavirus (2019-nCoV)
    <br/>21 January 2020
</p>

除了 2020 年 1 月 21 日 - 在
标签之后,我能够获取所有内容。

第二步代码如下,

all_items = contentpage.find_all('div', attrs = {'class': 'sf-content-block content-block'})

rowarray_list = []

for items in all_items:
#    print(items, end='\n'*10)
    situation_report = items.find("h3")
    if situation_report is not None:
        situation_report = situation_report.text

        more_items = items.find_all('div')
        for single_item in more_items:
#            print(single_item, end='\n'*10)
            child_item = single_item.find_all('p')
#            print(single_item.getText(), end='\n'*2)
#            print(single_item.next_element, end='\n'*2)
            
            for child in child_item:
                print(child.next_sibling, end='\n'*2)

写了下面的代码,

br_item = child.find_all('br')
for br in br_item:
    temp = br.next_sibling
    print(temp, end='\n'*2)

得到的输出为,

我想做的是只获取日期值。请帮忙!

看起来您只需要每个“p”标记中的最后一个元素。试试这个:

for i in soup.find_all('div', attrs={'class':'sf-content-block content-block'}):
    if i.find('p'):
        print(i.find('p').contents[-1])

尝试:

import requests

from  bs4 import BeautifulSoup
html = requests.get('https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports')
soup = BeautifulSoup(html.text, 'html.parser')



for div in soup.select('div.sf-content-block.content-block div p br + br'):
    text = div.find_next(text=True)
    print(text.strip())

打印:

2 August 2020
1 August 2020
31 July 2020
30 July 2020
29 July 2020
28 July 2020
27 July 2020
26 July 2020
25 July 2020
24 July 2020
23 July 2020

..等等........

另一个解决方案:

import requests
from bs4 import BeautifulSoup


url = 'https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for block in soup.select('p:has(>strong, >a)'):
    print(block.get_text(strip=True, separator='|').split('|')[-1])

打印:

2 August 2020
1 August 2020
31 July 2020
30 July 2020
29 July 2020
...and so on.