Beautifulsoup 获取嵌套跨度元素时遇到问题

Question

我正在使用 Python 和 BS4，我可以从页面中获取顶部条目，但我希望获取所有条目。

cardAttr = soup.find(class_='box_card_attribute').find("span", {"class": False}).text

cardAttr = soup.select_one('span.box_card_attribute >span').text

以上两个都会给我第一次迭代，但尝试使用 find_all 会给我一个 AttributeError。以下是 HTML.

的片段

    <div id="card_list" class="list">
                    <div class="t_row c_normal">
                        <div class="box_card_img">
                            <img id="card_image_0_1" alt="Tri-Horned Dragon" title="Tri-Horned Dragon" class="none">
                        </div>
                        <dl class="flex_1">
                            <dd class="box_card_name flex_1 top_set">
                                <span class="card_ruby"></span>
                                <span class="card_name">Tri-Horned Dragon</span>
                            </dd>
                            <dd class="icon flex_1 top_set">
                                <div class="lr_icon rid rid_5" style="background-color:#e86d6d;color:#e86d6d">
                                    <p>SE</p>
                                    <span style="background-color:#fff4f4;border-color:#e86d6d;color:#e86d6d; ">
                                            Secret Rare
                                    </span>
                                </div>
                            </dd>
                            <dd class="remove_btn top_set">
                                <a href="javascript:void(0);" class="btn hex red"  title="Remove this card from the list.">
                                    <span>X</span>
                                    <input type="hidden" class="lang" value="">
                                    <input type="hidden" class="cid" value="4711">
                                </a>
                            </dd>
                            <dd class="box_card_spec flex_1">
    
                                <span class="box_card_attribute">
                                    <img class="icon_img" src="external/image/parts/attribute/attribute_icon_dark.png" alt="DARK" title="DARK">
                                    <span>DARK</span>
                                </span>

目前我可以抓取 'DARK' 文本，但我似乎无法像使用 class=card_name 那样将其抓取到整个页面运行。

如果需要，这就是我正在查看的 url。

https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999

Answer 1

要获取所有卡片标题+它们的属性和文本，您可以使用下一个示例：

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

out = []
for t in soup.select(".t_row"):
    title = t.select_one(".card_name").get_text(strip=True)
    attrs = {
        s["class"][0]: re.sub(r"\s{2,}", "", s.get_text(strip=True))
        for s in t.select(".box_card_spec > span")
    }
    text = t.select_one(".box_card_text").get_text(strip=True)
    out.append({"title": title, **attrs, "text": text})

df = pd.DataFrame(out).fillna("")
print(df.head().to_markdown())
df.to_csv("data.csv", index=False)

打印：

	title	box_card_attribute	box_card_level_rank	card_info_species_and_other_item	atk_power	def_power	text
0	Tri-Horned Dragon	DARK	Level 8	[Dragon/Normal]	ATK 2850	DEF 2350	An unworthy dragon with three sharp horns sprouting from its head.
1	Blue-Eyes White Dragon	LIGHT	Level 8	[Dragon/Normal]	ATK 3000	DEF 2500	This legendary dragon is a powerful engine of destruction. Virtually invincible, very few have faced this awesome creature and lived to tell the tale.
2	Hitotsu-Me Giant	EARTH	Level 4	[Beast-Warrior/Normal]	ATK 1200	DEF 1000	A one-eyed behemoth with thick, powerful arms made for delivering punishing blows.
3	Flame Swordsman	FIRE	Level 5	[Warrior/Fusion]	ATK 1800	DEF 1600	"Flame Manipulator" + "Masaki the Legendary Swordsman"
4	Skull Servant	DARK	Level 1	[Zombie/Normal]	ATK 300	DEF 200	A skeletal ghost that isn't strong but can mean trouble in large numbers.

并保存 data.csv（来自 LibreOffice 的屏幕截图）：

Beautifulsoup 获取嵌套跨度元素时遇到问题

Beautifulsoup trouble getting nested span elements

python

beautifulsoup

web-scraping