Beautifulsoup 获取嵌套跨度元素时遇到问题
Beautifulsoup trouble getting nested span elements
我正在使用 Python 和 BS4,我可以从页面中获取顶部条目,但我希望获取所有条目。
cardAttr = soup.find(class_='box_card_attribute').find("span", {"class": False}).text
cardAttr = soup.select_one('span.box_card_attribute >span').text
以上两个都会给我第一次迭代,但尝试使用 find_all
会给我一个 AttributeError。以下是 HTML.
的片段
<div id="card_list" class="list">
<div class="t_row c_normal">
<div class="box_card_img">
<img id="card_image_0_1" alt="Tri-Horned Dragon" title="Tri-Horned Dragon" class="none">
</div>
<dl class="flex_1">
<dd class="box_card_name flex_1 top_set">
<span class="card_ruby"></span>
<span class="card_name">Tri-Horned Dragon</span>
</dd>
<dd class="icon flex_1 top_set">
<div class="lr_icon rid rid_5" style="background-color:#e86d6d;color:#e86d6d">
<p>SE</p>
<span style="background-color:#fff4f4;border-color:#e86d6d;color:#e86d6d; ">
Secret Rare
</span>
</div>
</dd>
<dd class="remove_btn top_set">
<a href="javascript:void(0);" class="btn hex red" title="Remove this card from the list.">
<span>X</span>
<input type="hidden" class="lang" value="">
<input type="hidden" class="cid" value="4711">
</a>
</dd>
<dd class="box_card_spec flex_1">
<span class="box_card_attribute">
<img class="icon_img" src="external/image/parts/attribute/attribute_icon_dark.png" alt="DARK" title="DARK">
<span>DARK</span>
</span>
目前我可以抓取 'DARK' 文本,但我似乎无法像使用 class=card_name
那样将其抓取到整个页面 运行。
如果需要,这就是我正在查看的 url。
https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999
要获取所有卡片标题+它们的属性和文本,您可以使用下一个示例:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
out = []
for t in soup.select(".t_row"):
title = t.select_one(".card_name").get_text(strip=True)
attrs = {
s["class"][0]: re.sub(r"\s{2,}", "", s.get_text(strip=True))
for s in t.select(".box_card_spec > span")
}
text = t.select_one(".box_card_text").get_text(strip=True)
out.append({"title": title, **attrs, "text": text})
df = pd.DataFrame(out).fillna("")
print(df.head().to_markdown())
df.to_csv("data.csv", index=False)
打印:
title
box_card_attribute
box_card_level_rank
card_info_species_and_other_item
atk_power
def_power
text
box_card_effect
0
Tri-Horned Dragon
DARK
Level 8
[Dragon/Normal]
ATK 2850
DEF 2350
An unworthy dragon with three sharp horns sprouting from its head.
1
Blue-Eyes White Dragon
LIGHT
Level 8
[Dragon/Normal]
ATK 3000
DEF 2500
This legendary dragon is a powerful engine of destruction. Virtually invincible, very few have faced this awesome creature and lived to tell the tale.
2
Hitotsu-Me Giant
EARTH
Level 4
[Beast-Warrior/Normal]
ATK 1200
DEF 1000
A one-eyed behemoth with thick, powerful arms made for delivering punishing blows.
3
Flame Swordsman
FIRE
Level 5
[Warrior/Fusion]
ATK 1800
DEF 1600
"Flame Manipulator" + "Masaki the Legendary Swordsman"
4
Skull Servant
DARK
Level 1
[Zombie/Normal]
ATK 300
DEF 200
A skeletal ghost that isn't strong but can mean trouble in large numbers.
并保存 data.csv
(来自 LibreOffice 的屏幕截图):
我正在使用 Python 和 BS4,我可以从页面中获取顶部条目,但我希望获取所有条目。
cardAttr = soup.find(class_='box_card_attribute').find("span", {"class": False}).text
cardAttr = soup.select_one('span.box_card_attribute >span').text
以上两个都会给我第一次迭代,但尝试使用 find_all
会给我一个 AttributeError。以下是 HTML.
<div id="card_list" class="list">
<div class="t_row c_normal">
<div class="box_card_img">
<img id="card_image_0_1" alt="Tri-Horned Dragon" title="Tri-Horned Dragon" class="none">
</div>
<dl class="flex_1">
<dd class="box_card_name flex_1 top_set">
<span class="card_ruby"></span>
<span class="card_name">Tri-Horned Dragon</span>
</dd>
<dd class="icon flex_1 top_set">
<div class="lr_icon rid rid_5" style="background-color:#e86d6d;color:#e86d6d">
<p>SE</p>
<span style="background-color:#fff4f4;border-color:#e86d6d;color:#e86d6d; ">
Secret Rare
</span>
</div>
</dd>
<dd class="remove_btn top_set">
<a href="javascript:void(0);" class="btn hex red" title="Remove this card from the list.">
<span>X</span>
<input type="hidden" class="lang" value="">
<input type="hidden" class="cid" value="4711">
</a>
</dd>
<dd class="box_card_spec flex_1">
<span class="box_card_attribute">
<img class="icon_img" src="external/image/parts/attribute/attribute_icon_dark.png" alt="DARK" title="DARK">
<span>DARK</span>
</span>
目前我可以抓取 'DARK' 文本,但我似乎无法像使用 class=card_name
那样将其抓取到整个页面 运行。
如果需要,这就是我正在查看的 url。
https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999
要获取所有卡片标题+它们的属性和文本,您可以使用下一个示例:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
out = []
for t in soup.select(".t_row"):
title = t.select_one(".card_name").get_text(strip=True)
attrs = {
s["class"][0]: re.sub(r"\s{2,}", "", s.get_text(strip=True))
for s in t.select(".box_card_spec > span")
}
text = t.select_one(".box_card_text").get_text(strip=True)
out.append({"title": title, **attrs, "text": text})
df = pd.DataFrame(out).fillna("")
print(df.head().to_markdown())
df.to_csv("data.csv", index=False)
打印:
title | box_card_attribute | box_card_level_rank | card_info_species_and_other_item | atk_power | def_power | text | box_card_effect | |
---|---|---|---|---|---|---|---|---|
0 | Tri-Horned Dragon | DARK | Level 8 | [Dragon/Normal] | ATK 2850 | DEF 2350 | An unworthy dragon with three sharp horns sprouting from its head. | |
1 | Blue-Eyes White Dragon | LIGHT | Level 8 | [Dragon/Normal] | ATK 3000 | DEF 2500 | This legendary dragon is a powerful engine of destruction. Virtually invincible, very few have faced this awesome creature and lived to tell the tale. | |
2 | Hitotsu-Me Giant | EARTH | Level 4 | [Beast-Warrior/Normal] | ATK 1200 | DEF 1000 | A one-eyed behemoth with thick, powerful arms made for delivering punishing blows. | |
3 | Flame Swordsman | FIRE | Level 5 | [Warrior/Fusion] | ATK 1800 | DEF 1600 | "Flame Manipulator" + "Masaki the Legendary Swordsman" | |
4 | Skull Servant | DARK | Level 1 | [Zombie/Normal] | ATK 300 | DEF 200 | A skeletal ghost that isn't strong but can mean trouble in large numbers. |
并保存 data.csv
(来自 LibreOffice 的屏幕截图):