使用 BeautifulSoup 获取标签和文本

Using BeautifulSoup to get tags and text

我现在尝试了一段时间,但卡住了。我的网站有以下结构(不幸的是我只有截图,不知何故我无法复制粘贴代码...)

编辑:抱歉,当然,这是其中一个 URL:

https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system

我找到了 div class="field 字段等....我想将所有内容存储在 'strong' 或 "h4" 中作为数据框列名(得到那部分)和相应的文本。我部分成功了,我只丢失了“项目 Objective”下的第二个

标签内容,我完全迷失了“合作伙伴”和
标签之间的文本。 这就是我所做的:

content = soup.find_all('div', class_='field field--text_default field--body')

# For the headers:
headers = content[0].find_all(["strong","h4"])
col_names = []
for header in headers:
    col_names.append(header.text)

# and for the content:
con = []
divs = content[0].findAll(["strong", "h4"])
for el in divs:
    con.append(el.nextSibling)
con = [el.string for el in inhalt if el != None]

跟随 furas 并与 children 合作,我再次发现以下部分解决方案:

headers, inhalt = [],[]
tag = content[0].find_all(["p","h4"])
for i in range(len(tag)):
    for child in tag[i].children:
        if type(child) == bs4.element.Tag:
            if child.name == "strong":
                headers.append(child.get_text().strip(": "))
                #print("\n",type(child), " ",child.name, child, child.get_text().strip(": "))
        if type(child) == bs4.element.NavigableString:
            if child == "Project Objective" or child == "Project Impact" or child == "Contacts":
                headers.append(child)
            else:
                inhalt.append(child)

不幸的是,我必须把一次3children和一次两次children放在一个headers中。这三个总是以“--”开头,所以应该不会太难,但是如何将两个单独的 < p > 放入一个单元格中?

是@Sebastian版本的修改。

我把所有的都放在一个列表 data 中成对 (header, text) 但我没有直接将它添加到这个列表中。

当我找到 header 时,我将其保存在单独的变量 header 中。当我找到 text 时,我也会将它保存在单独的列表 text 中。只有当我找到下一个 header 时,我才会将前一个 header, text 添加到 data。最后,我必须将 last header, text 添加到 data。我还使用 header = None 来识别我是否找到拳头 header 而不是添加空对 header, text.

因为我将所有 text 保留为列表,这样我以后可以决定是显示在一行中还是分开显示(例如 Partners 中的 --

我还为 <a> 添加代码以获取电子邮件地址。我正在考虑为 <br>.

添加代码
import requests
import bs4
from bs4 import BeautifulSoup as BS

url = 'https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

content = soup.find_all('div', class_='field field--text_default field--body')
#print(content)

data = []   # list for pairs `(header, text)`

header = None  # last found `header`
text = []      # all text found after last `header`


all_tags = content[0].find_all(["p","h4"])

for tag in all_tags:

    for child in tag.children:
        if isinstance(child, bs4.element.Tag):
            if child.name in "strong":
                # put previouse `header + text`
                if header is not None:  # don't before first header
                    data.append( [header, text] )

                # remember new `header` and make place for new text
                header = child.get_text().strip(": ")
                text = []

            #if child.name in "br":
            #    text.append('\n')
                
            if child.name in "a":
                text.append(child.get_text().strip())

        if isinstance(child, bs4.element.NavigableString):
            if child in ("Project Objective", "Project Impact", "Contacts"):
                # put previouse `header + text`
                if header is not None:  # don't before first header
                    data.append( [header, text] )

                # remember new `header` and make place for new text
                header = child.strip()
                text = []
            else:
                # remember `text`
                text.append(child.strip())

# add last `header + text`
if header is not None:  # don't before first header
    data.append( [header, text] )

# --- display ---

print('len(data):', len(data), '\n')

for header, text in data:
    print('header:', header)
    print('--- text ---')
    #print(' '.join(text).strip('\n'))
    if header == 'Partners':
        print('\n'.join(text))
    else:        
        print(' '.join(text))
    print('====================================')

结果:

只有 header Contact 是空的,因为元素在 headers DOE Technology ManagerLead Performer

len(data): 11 

header: Lead Performer
--- text ---
Cold Climate Housing Research Center – Fairbanks, AK
====================================
header: Partners
--- text ---
-- Panasonic Corp. – Newark, NJ
-- Taġiuġmiullu Nunamiullu Housing Authority – Utqiagvik, AK
-- National Renewable Energy Laboratory, Golden, CO
====================================
header: DOE Total Funding
--- text ---
5,161
====================================
header: Cost Share
--- text ---
,293
====================================
header: Project Term
--- text ---
July 2020 – May 2022
====================================
header: Funding Type
--- text ---
Advanced Building Construction FOA Award
====================================
header: Project Objective
--- text ---
Vacuum insulated panels (VIPs) are poised to transform the building industry by making homes more energy efficient with little additional upfront cost. However, they are currently uncommon due to their inherent fragility. As the R-value relies on the vacuum inside the panel, any damage to the panel negates the insulation value of the system. With today’s residential construction methods and fastener technology, it is nearly impossible to avoid damaging panels during assembly or over the life of the home. These issues make VIPs incompatible with current construction techniques. To overcome these issues and capitalize on the high R-value of VIPs, the project team will develop a new building system with durable assemblies that can perform in Arctic conditions. The long-term plan is to make the system a mass-market building platform that can address the need for affordable, high-efficiency housing across the nation. This starts with a proof of concept that will be built and tested at the Cold Climate Housing Research Center in Fairbanks, Alaska. Developing this concept in the country’s only Arctic state, which has the coldest temperatures and highest energy costs in the U.S., will ensure its durability and performance in other climates.
====================================
header: Project Impact
--- text ---
The energy-savings payback of this system is estimated to be eight years with applicability and potential benefit in every U.S. climate zone. For remote regions such as central Alaska, the payback would be even shorter as the cost of energy exceeds the assumed retail energy cost. Considering the building envelope alone, this system is expected to achieve a reduction in heating/cooling energy of at least 48% and an annual savings of 1,637 TBtu if implemented nationwide.
====================================
header: Contacts
--- text ---

====================================
header: DOE Technology Manager
--- text ---
Marc LaFrance, Marc.Lafrance@ee.doe.gov 
====================================
header: Lead Performer
--- text ---
Bruno Grunau, Cold Climate Housing Research Center
====================================