按 Python BeautifulSoup 抓取下拉菜单值
Scraping dropdown menu values by Python BeautifulSoup
我查看了大部分帖子,但没有找到我的小问题的回复。
这是我要 抓取的下拉菜单:
<div class="input-box">
<select name="super_attribute[138]" id="attribute138" class="required-entry super-attribute select form-control" onchange="notifyMe(this.value, this.options[this.selectedIndex].innerHTML);">
<option value="">Choose an Option...</option>
<option value="17" price="0">M (in stock) </option>
<option value="18" price="0">L (out of stock) </option>
<option value="15" price="0">XL (in stock) </option>
<option value="52" price="0">XXL (in stock) </option>
</select>
</div>
我的Python代码是:
items = soup.select('option[value]')
values = [item.get('value') for item in items]
textvalues = [item.text for item in items]
print(textvalues)
输出为:
['select'、'(有货)'、'(缺货)'、'(有货)'、'(有货)']
我的要求是我还需要其他值 (SizeValue & SizeName):
17 & M / 18 & L / 15 & XL / 52 & XXL
如果我删除了 .text ,我有这个输出:
<option value="">select</option>, <option value="200@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="201@#-(Out-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(Out-Stock)</option>, <option value="202@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="203@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>
提前感谢您的帮助。
很简单,加一个+
然后在你的list-comprehension中调用item.text
就可以了。
而不是:
values = [item.get('value') for item in items]
使用:
values = [item.get('value') + item.get_text(strip=True) for item in items[1:]]
print(values)
编辑:数据是动态加载的,所以requests
不支持它。但是数据 可以在网站上以 JSON 格式获得 。您可以使用 re
模块使用正则表达式提取它:
import json
import re
import requests
url = "https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html"
response = requests.get(url).content
regex_pattern = re.compile(r"Product\.Config\(({.*?})\);")
data = json.loads(regex_pattern.search(str(response)).group(1))
print(
[
product["id"] + product["label"]
for product in data["attributes"]["138"]["options"]
]
)
输出:
['17M (in stock) ', '18L (out of stock) ', '15XL (in stock) ', '52XXL (in stock) ']
我查看了大部分帖子,但没有找到我的小问题的回复。
这是我要 抓取的下拉菜单:
<div class="input-box">
<select name="super_attribute[138]" id="attribute138" class="required-entry super-attribute select form-control" onchange="notifyMe(this.value, this.options[this.selectedIndex].innerHTML);">
<option value="">Choose an Option...</option>
<option value="17" price="0">M (in stock) </option>
<option value="18" price="0">L (out of stock) </option>
<option value="15" price="0">XL (in stock) </option>
<option value="52" price="0">XXL (in stock) </option>
</select>
</div>
我的Python代码是:
items = soup.select('option[value]')
values = [item.get('value') for item in items]
textvalues = [item.text for item in items]
print(textvalues)
输出为: ['select'、'(有货)'、'(缺货)'、'(有货)'、'(有货)']
我的要求是我还需要其他值 (SizeValue & SizeName): 17 & M / 18 & L / 15 & XL / 52 & XXL
如果我删除了 .text ,我有这个输出:
<option value="">select</option>, <option value="200@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="201@#-(Out-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(Out-Stock)</option>, <option value="202@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>, <option value="203@#-(In-Stock)@#-https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html@#-">(In-Stock)</option>
提前感谢您的帮助。
很简单,加一个+
然后在你的list-comprehension中调用item.text
就可以了。
而不是:
values = [item.get('value') for item in items]
使用:
values = [item.get('value') + item.get_text(strip=True) for item in items[1:]]
print(values)
编辑:数据是动态加载的,所以requests
不支持它。但是数据 可以在网站上以 JSON 格式获得 。您可以使用 re
模块使用正则表达式提取它:
import json
import re
import requests
url = "https://store.alsabihmarine.com/index.php/diving-equipments/wetsuits/camouflage-hooded-suits-220.html"
response = requests.get(url).content
regex_pattern = re.compile(r"Product\.Config\(({.*?})\);")
data = json.loads(regex_pattern.search(str(response)).group(1))
print(
[
product["id"] + product["label"]
for product in data["attributes"]["138"]["options"]
]
)
输出:
['17M (in stock) ', '18L (out of stock) ', '15XL (in stock) ', '52XXL (in stock) ']