在 Python 中指定 html 嵌套结构
Specifying a html nest structure in Python
我正在尝试从以下网站抓取数据
https://www.ecfr.gov/on/2022-04-08/title-21/chapter-I/subchapter-E/part-556/subpart-B/section-556.50
注意有一个巢结构(公差 -> 牛 -> 肝脏和肌肉)。这也是该立法的许多部分之一。
有一个“开发者工具”选项,但我无法保持嵌套结构
https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50
我想将此 html 转换为 pandas 数据框,同时保留嵌套结构。例如:
h4
Indent-2
Indent-3
Amprolium
(1) Cattle
(i) Liver, kidney, and muscle: 0.5 ppm.
问题是class“Indent-3”应该嵌套在“Indent-2”中,而“Indent-2”应该嵌套在“h4”中。我可以通过指定每个 class 名称来创建所需的数据,但是如果我想遍历这些部分,我不想必须指定每个 class 名称。
是否有更通用的方法(不指定 class 名称)来生成数据框?到目前为止,这是我的代码。
import requests
from bs4 import BeautifulSoup
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
df =pd.DataFrame()
title = soup.find("h4").text
id2 = soup.find("div", attrs = {"id":"p-556.50(b)(1)"}).find(attrs = {"class":"indent-2"}).text
id3 = soup.find("div", attrs = {"id":"p-556.50(b)(1)(i)"}).find(attrs = {"class":"indent-3"}).text
df = pd.DataFrame(data = {"h4":[title],
"indent-2":[id2],
"indent-3":[id3]})
您正在寻找这样的东西吗?
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
title = soup.find("h4").text
indents = soup.find_all(attrs = {'class':re.compile("^indent-")})
row = {'h4':[title]}
for indent in indents:
print(indent)
key = 'indent-' + indent['class'].split('-')[-1]
if key not in row.keys():
row[key] = []
row[key].append(indent.text.strip())
df = pd.concat([pd.DataFrame({k:v}) for k,v in row.items()], axis=1)
输出:
print(df.to_string())
h4 indent-1 indent-2 indent-3 indent-4
0 § 556.50 Amprolium. (a) [Reserved] (1) Cattle. (i) Liver, kidney, and muscle: 0.5 ppm. (A) Egg yolks: 8 ppm.
1 NaN (b) Tolerances. The tolerances for amprolium are: (2) Chickens and turkeys. (ii) Fat: 2.0 ppm. (B) Whole eggs: 4 ppm.
2 NaN (c) Related conditions of use. See §§ 520.100, 558.55, and 558.58 of this chapter. (3) Pheasants. (i) Liver and kidney: 1 ppm. NaN
3 NaN NaN NaN (ii) Muscle: 0.5 ppm. NaN
4 NaN NaN NaN (iii) Eggs: NaN
5 NaN NaN NaN (i) Liver: 1 ppm. NaN
6 NaN NaN NaN (ii) Muscle: 0.5 ppm. NaN
为了浏览嵌套的div,思路是使用children参数。虽然 @chitown88 的回答可能会解决您的问题并且看起来更干净。这是使用 findChildren()
和嵌套循环的答案。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
df = pd.DataFrame()
title = soup.find("h4").text
id2 = []
id3 = []
id = soup.find('div', {"class": "section"}).get('id')
divGlobal = soup.find('div', {'id': 'p-' + id + "(b)"})
for lvl1 in divGlobal.findChildren("div", recursive=False): # (1) level
for lvl2 in lvl1.findChildren("div", recursive=False): # (i) level
if len(lvl2.findChildren("div", recursive=False)) > 0:
for lvl3 in lvl2.findChildren("div", recursive=False): # (A) level (eggs in this example)
id2.append(lvl1.findChildren("p")[0].text)
id3.append(lvl3.findChildren("p")[0].text)
else:
id2.append(lvl1.findChildren("p")[0].text)
id3.append(lvl2.findChildren("p")[0].text)
df = pd.DataFrame(
{"h4": [title for i in range(len(id2))],
"indent-2": id2,
"indent-3": id3
}
)
对于糟糕的变量名,我深感抱歉,我不知道你的数据代表什么。
输出:
h4 indent-2 indent-3
0 § 556.50 Amprolium. (1) Cattle. (i) Liver, kidney, and muscle: 0.5 ppm.
1 § 556.50 Amprolium. (1) Cattle. (ii) Fat: 2.0 ppm.
2 § 556.50 Amprolium. (2) Chickens and turkeys. (i) Liver and kidney: 1 ppm.
3 § 556.50 Amprolium. (2) Chickens and turkeys. (ii) Muscle: 0.5 ppm.
4 § 556.50 Amprolium. (2) Chickens and turkeys. (A) Egg yolks: 8 ppm.
5 § 556.50 Amprolium. (2) Chickens and turkeys. (B) Whole eggs: 4 ppm.
6 § 556.50 Amprolium. (3) Pheasants. (i) Liver: 1 ppm.
7 § 556.50 Amprolium. (3) Pheasants. (ii) Muscle: 0.5 ppm.
我正在尝试从以下网站抓取数据 https://www.ecfr.gov/on/2022-04-08/title-21/chapter-I/subchapter-E/part-556/subpart-B/section-556.50
注意有一个巢结构(公差 -> 牛 -> 肝脏和肌肉)。这也是该立法的许多部分之一。
有一个“开发者工具”选项,但我无法保持嵌套结构 https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50
我想将此 html 转换为 pandas 数据框,同时保留嵌套结构。例如:
h4 | Indent-2 | Indent-3 |
---|---|---|
Amprolium | (1) Cattle | (i) Liver, kidney, and muscle: 0.5 ppm. |
问题是class“Indent-3”应该嵌套在“Indent-2”中,而“Indent-2”应该嵌套在“h4”中。我可以通过指定每个 class 名称来创建所需的数据,但是如果我想遍历这些部分,我不想必须指定每个 class 名称。
是否有更通用的方法(不指定 class 名称)来生成数据框?到目前为止,这是我的代码。
import requests
from bs4 import BeautifulSoup
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
df =pd.DataFrame()
title = soup.find("h4").text
id2 = soup.find("div", attrs = {"id":"p-556.50(b)(1)"}).find(attrs = {"class":"indent-2"}).text
id3 = soup.find("div", attrs = {"id":"p-556.50(b)(1)(i)"}).find(attrs = {"class":"indent-3"}).text
df = pd.DataFrame(data = {"h4":[title],
"indent-2":[id2],
"indent-3":[id3]})
您正在寻找这样的东西吗?
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
title = soup.find("h4").text
indents = soup.find_all(attrs = {'class':re.compile("^indent-")})
row = {'h4':[title]}
for indent in indents:
print(indent)
key = 'indent-' + indent['class'].split('-')[-1]
if key not in row.keys():
row[key] = []
row[key].append(indent.text.strip())
df = pd.concat([pd.DataFrame({k:v}) for k,v in row.items()], axis=1)
输出:
print(df.to_string())
h4 indent-1 indent-2 indent-3 indent-4
0 § 556.50 Amprolium. (a) [Reserved] (1) Cattle. (i) Liver, kidney, and muscle: 0.5 ppm. (A) Egg yolks: 8 ppm.
1 NaN (b) Tolerances. The tolerances for amprolium are: (2) Chickens and turkeys. (ii) Fat: 2.0 ppm. (B) Whole eggs: 4 ppm.
2 NaN (c) Related conditions of use. See §§ 520.100, 558.55, and 558.58 of this chapter. (3) Pheasants. (i) Liver and kidney: 1 ppm. NaN
3 NaN NaN NaN (ii) Muscle: 0.5 ppm. NaN
4 NaN NaN NaN (iii) Eggs: NaN
5 NaN NaN NaN (i) Liver: 1 ppm. NaN
6 NaN NaN NaN (ii) Muscle: 0.5 ppm. NaN
为了浏览嵌套的div,思路是使用children参数。虽然 @chitown88 的回答可能会解决您的问题并且看起来更干净。这是使用 findChildren()
和嵌套循环的答案。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
df = pd.DataFrame()
title = soup.find("h4").text
id2 = []
id3 = []
id = soup.find('div', {"class": "section"}).get('id')
divGlobal = soup.find('div', {'id': 'p-' + id + "(b)"})
for lvl1 in divGlobal.findChildren("div", recursive=False): # (1) level
for lvl2 in lvl1.findChildren("div", recursive=False): # (i) level
if len(lvl2.findChildren("div", recursive=False)) > 0:
for lvl3 in lvl2.findChildren("div", recursive=False): # (A) level (eggs in this example)
id2.append(lvl1.findChildren("p")[0].text)
id3.append(lvl3.findChildren("p")[0].text)
else:
id2.append(lvl1.findChildren("p")[0].text)
id3.append(lvl2.findChildren("p")[0].text)
df = pd.DataFrame(
{"h4": [title for i in range(len(id2))],
"indent-2": id2,
"indent-3": id3
}
)
对于糟糕的变量名,我深感抱歉,我不知道你的数据代表什么。
输出:
h4 indent-2 indent-3
0 § 556.50 Amprolium. (1) Cattle. (i) Liver, kidney, and muscle: 0.5 ppm.
1 § 556.50 Amprolium. (1) Cattle. (ii) Fat: 2.0 ppm.
2 § 556.50 Amprolium. (2) Chickens and turkeys. (i) Liver and kidney: 1 ppm.
3 § 556.50 Amprolium. (2) Chickens and turkeys. (ii) Muscle: 0.5 ppm.
4 § 556.50 Amprolium. (2) Chickens and turkeys. (A) Egg yolks: 8 ppm.
5 § 556.50 Amprolium. (2) Chickens and turkeys. (B) Whole eggs: 4 ppm.
6 § 556.50 Amprolium. (3) Pheasants. (i) Liver: 1 ppm.
7 § 556.50 Amprolium. (3) Pheasants. (ii) Muscle: 0.5 ppm.