在 Python 中指定 html 嵌套结构

Specifying a html nest structure in Python

我正在尝试从以下网站抓取数据 https://www.ecfr.gov/on/2022-04-08/title-21/chapter-I/subchapter-E/part-556/subpart-B/section-556.50

注意有一个巢结构(公差 -> 牛 -> 肝脏和肌肉)。这也是该立法的许多部分之一。

有一个“开发者工具”选项,但我无法保持嵌套结构 https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50

我想将此 html 转换为 pandas 数据框,同时保留嵌套结构。例如:

h4 Indent-2 Indent-3
Amprolium (1) Cattle (i) Liver, kidney, and muscle: 0.5 ppm.

问题是class“Indent-3”应该嵌套在“Indent-2”中,而“Indent-2”应该嵌套在“h4”中。我可以通过指定每个 class 名称来创建所需的数据,但是如果我想遍历这些部分,我不想必须指定每个 class 名称。

是否有更通用的方法(不指定 class 名称)来生成数据框?到目前为止,这是我的代码。

import requests
from bs4 import BeautifulSoup

url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50"

r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
df =pd.DataFrame()

title = soup.find("h4").text
id2 = soup.find("div", attrs = {"id":"p-556.50(b)(1)"}).find(attrs = {"class":"indent-2"}).text
id3 = soup.find("div", attrs = {"id":"p-556.50(b)(1)(i)"}).find(attrs = {"class":"indent-3"}).text

df = pd.DataFrame(data = {"h4":[title],
                          "indent-2":[id2],
                          "indent-3":[id3]})

您正在寻找这样的东西吗?

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50"

r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")

title = soup.find("h4").text
indents = soup.find_all(attrs = {'class':re.compile("^indent-")})

row = {'h4':[title]}

for indent in indents:
    print(indent)
    
    key = 'indent-' + indent['class'].split('-')[-1]
    if key not in row.keys():
        row[key] = []
        
    row[key].append(indent.text.strip())

df = pd.concat([pd.DataFrame({k:v}) for k,v in row.items()], axis=1)

输出:

print(df.to_string())
                    h4                                                                             indent-1                   indent-2                                 indent-3                indent-4
0  § 556.50 Amprolium.                                                                       (a) [Reserved]                (1) Cattle.  (i) Liver, kidney, and muscle: 0.5 ppm.   (A) Egg yolks: 8 ppm.
1                  NaN                                   (b) Tolerances.  The tolerances for amprolium are:  (2) Chickens and turkeys.                       (ii) Fat: 2.0 ppm.  (B) Whole eggs: 4 ppm.
2                  NaN  (c) Related conditions of use.  See §§ 520.100, 558.55, and 558.58 of this chapter.             (3) Pheasants.             (i) Liver and kidney: 1 ppm.                     NaN
3                  NaN                                                                                  NaN                        NaN                    (ii) Muscle: 0.5 ppm.                     NaN
4                  NaN                                                                                  NaN                        NaN                              (iii) Eggs:                     NaN
5                  NaN                                                                                  NaN                        NaN                        (i) Liver: 1 ppm.                     NaN
6                  NaN                                                                                  NaN                        NaN                    (ii) Muscle: 0.5 ppm.                     NaN

为了浏览嵌套的div,思路是使用children参数。虽然 @chitown88 的回答可能会解决您的问题并且看起来更干净。这是使用 findChildren() 和嵌套循环的答案。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50"

r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
df = pd.DataFrame()

title = soup.find("h4").text
id2 = []
id3 = []

id = soup.find('div', {"class": "section"}).get('id')
divGlobal = soup.find('div', {'id': 'p-' + id + "(b)"})

for lvl1 in divGlobal.findChildren("div", recursive=False):  # (1) level

    for lvl2 in lvl1.findChildren("div", recursive=False):  # (i) level

        if len(lvl2.findChildren("div", recursive=False)) > 0:
            for lvl3 in lvl2.findChildren("div", recursive=False):  # (A) level (eggs in this example)
                id2.append(lvl1.findChildren("p")[0].text)
                id3.append(lvl3.findChildren("p")[0].text)

        else:
            id2.append(lvl1.findChildren("p")[0].text)
            id3.append(lvl2.findChildren("p")[0].text)


df = pd.DataFrame(
    {"h4": [title for i in range(len(id2))],
     "indent-2": id2,
     "indent-3": id3
     }
)

对于糟糕的变量名,我深感抱歉,我不知道你的数据代表什么。

输出:

                    h4                    indent-2                                  indent-3
0  § 556.50 Amprolium.                (1) Cattle.   (i) Liver, kidney, and muscle: 0.5 ppm. 
1  § 556.50 Amprolium.                (1) Cattle.                        (ii) Fat: 2.0 ppm. 
2  § 556.50 Amprolium.  (2) Chickens and turkeys.              (i) Liver and kidney: 1 ppm. 
3  § 556.50 Amprolium.  (2) Chickens and turkeys.                     (ii) Muscle: 0.5 ppm. 
4  § 556.50 Amprolium.  (2) Chickens and turkeys.                     (A) Egg yolks: 8 ppm. 
5  § 556.50 Amprolium.  (2) Chickens and turkeys.                    (B) Whole eggs: 4 ppm. 
6  § 556.50 Amprolium.             (3) Pheasants.                         (i) Liver: 1 ppm. 
7  § 556.50 Amprolium.             (3) Pheasants.                     (ii) Muscle: 0.5 ppm.