按 Python BeautifulSoup 抓取下拉菜单值

Scraping dropdown menu values by Python BeautifulSoup

我查看了大部分帖子,但没有找到我的小问题的回复。

这是我要 抓取的下拉菜单:

<div class="input-box">
    <select name="super_attribute[138]" id="attribute138" class="required-entry super-attribute select form-control" onchange="notifyMe(this.value, this.options[this.selectedIndex].innerHTML);">
        <option value="">Choose an Option...</option>
        <option value="17" price="0">M (in stock) </option>
        <option value="18" price="0">L (out of stock) </option>
        <option value="15" price="0">XL (in stock) </option>
        <option value="52" price="0">XXL (in stock) </option>
    </select>
</div>

我的Python代码是:

items = soup.select('option[value]')
values = [item.get('value') for item in items]
textvalues = [item.text for item in items]

print(textvalues)

输出为: ['select'、'(有货)'、'(缺货)'、'(有货)'、'(有货)']

我的要求是我还需要其他值 (SizeValue & SizeName): 17 & M / 18 & L / 15 & XL / 52 & XXL

如果我删除了 .text ,我有这个输出:

   <option value="">select</option>, <option value="200@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="201@#-(Out-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(Out-Stock)</option>, <option value="202@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="203@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>

提前感谢您的帮助。

很简单,加一个+然后在你的list-comprehension中调用item.text就可以了。

而不是:

values = [item.get('value') for item in items]

使用:

values = [item.get('value') + item.get_text(strip=True) for item in items[1:]]
print(values)

编辑:数据是动态加载的,所以requests不支持它。但是数据 可以在网站上以 JSON 格式获得 。您可以使用 re 模块使用正则表达式提取它:

import json
import re
import requests


url = "https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html"
response = requests.get(url).content

regex_pattern = re.compile(r"Product\.Config\(({.*?})\);")
data = json.loads(regex_pattern.search(str(response)).group(1))

print(
    [
        product["id"] + product["label"]
        for product in data["attributes"]["138"]["options"]
    ]
)

输出:

['17M (in stock) ', '18L (out of stock) ', '15XL (in stock) ', '52XXL (in stock) ']