Web Scrape - 如何使用包含值单击基于 select 名称的选项？

Question

我正在尝试抓取下面的 select 下拉菜单以获取文本内容。

我无法使用该名称，因为“_P889O1”对于我将尝试从中提取数据的每个产品都发生了变化。但是，我想我可以使用 'contains' 但我得到的错误是脚本“不是有效的 XPath 表达式”

总价会根据期权价值发生变化，因此我认为此处需要点击一下？

HTML 正在使用：

<div class="GC75 ProductChoiceName" id="ProductChoiceName-%%" sf:object="ProductChoiceName" style="color: rgb(36, 36, 36);">
<select name="_P889O1Barrel length" onfocus="if(tf.core.idTextOptionBlur){clearTimeout(tf.core.idTextOptionBlur);tf.core.idTextOptionBlur=null;}if(tf.core.onblurcode){eval(tf.core.onblurcode);tf.core.onblurcode='';tf.core.setFocusID='';}" onclick="cancelBuble(event);if(tf.isInSF())return false;" onchange="tf.core.crFFldImager.replace('P889',this.value.split(core.str_sep1)[7]);var c = this.value;tf.core.crFFldOptPrc.updPrc('P889',this.value?this.value.split(core.str_sep1)[7]:'P889O1',crFFldArr,opt);dBasePrice2('P889',c);return false;" size="1">
<option value="">Barrel length&nbsp;*</option><option value="28&quot;~|`0~|`0.00~|`50220~|`0.000000~|`0.000~|`~|`P889O1C1" origvalue="28&quot;~|`0~|`0.00~|`50220~|`0.000000~|`0.000~|`~|`P889O1C1">28"</option>
<option value="30&quot;~|`0~|`0.00~|`50222~|`0.000000~|`0.000~|`~|`P889O1C2" origvalue="30&quot;~|`0~|`0.00~|`50222~|`0.000000~|`0.000~|`~|`P889O1C2">30"</option>
<option value="32&quot;~|`0~|`0.00~|`50224~|`0.000000~|`0.000~|`~|`P889O1C3" origvalue="32&quot;~|`0~|`0.00~|`50224~|`0.000000~|`0.000~|`~|`P889O1C3">32"</option>
</select></div>

完整代码片段：

from bs4 import BeautifulSoup
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select 
os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)

#beautiful soup requests
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")


barrels = soup.find_all('select', attrs={'name': re.compile('length')})
[[x['origvalue'][:2] for x in i.find_all('option')[1:]] for i in barrels]


for product in products:
#title
    title = product.find("h3") 
    titleText = title.text if title else ''

#manufacturer name
    manufacturer = product.find("div", "GC5 ProductManufacturer")
    manuText = manufacturer.text if manufacturer else ''

 #image location
    img = product.find("div", "ProductImage")
    imglinks = img.find("a") if img else ''
    imglinkhref = imglinks.get('href')  if imglinks else ''
    imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
 
#description
    description = product.find("div", "GC12 ProductDescription")
    descText = description.text if description else ''

#more description
    more = product.find("div", "GC12 ProductDetailedDescription")
    moreText = more.text if more else ''

#price
    spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
    for i in range(0,len(spans),2):
        span = spans[i].text
        i+=1 

        #print(span)
        #print(titleText)
        #print(manuText)
        #print(descText)
        #print(moreText)
        #print(imgurl.replace('..', ''))
        #print("\n")

两次我都将 Print(x) 作为一种视觉辅助来向自己展示它的“工作”

Answer 1

您需要 Selenium，因为下拉菜单是通过 Javascript 生成的。 2 个建议：Selenium 需要一些时间来动态加载页面，因此实现一个 time.sleep 以允许这样做。其次，xpath 语法需要一个小改动：

import time
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)
dropd = browser.find_element_by_xpath("//select[contains(@name, 'Barrel')]")

输出print(dropd.text):

Barrel length *
28"
30"
32"

或者您可以将 BeautifulSoup 与 Selenium 结合使用：

import time
import re

browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)
soup = BeautifulSoup(browser.page_source)
barrels = soup.find_all('select', attrs={'name': re.compile('length')})
[[x['origvalue'][:2] for x in i.find_all('option')[1:]] for i in barrels]

使其适合您的完整代码：

browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
time.sleep(2)

soup = BeautifulSoup(browser.page_source)
products = soup.find_all("div", "GC62 Product")

for product in products:
#barrel lengths
  barrels = product.find('select', attrs={'name': re.compile('length')})
  if barrels:
    barrels_list = [x['origvalue'][:2] for x in barrels.find_all('option')[1:]]
    print(barrels_list)

Web Scrape - 如何使用包含值单击基于 select 名称的选项？

Web Scrape - How do I click options based on select name using contains value?

html

python

selenium

urllib2

beautifulsoup