如何通过 Python Selenium BeautifulSoup 从网站提取证券价格作为文本
How to extract the price for the security as text from the website through Python Selenium BeautifulSoup
我想简单地获取 https://investor.vanguard.com/529-plan/profile/4514 中显示的证券价格。我运行这个代码:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'C:\Program_Files_EllieTheGoodDog\Geckodriver\geckodriver.exe')
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
当我在selenium打开Firefox中"inspect element"价格时,我清楚地看到了这个:
<span data-ng-if="!data.isLayer" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" class="ng-scope ng-binding arrange">.91</span >
但是这些数据不在我的汤里。如果我打印我的汤,html 与网站上显示的确实有很大不同。我试过了,但完全失败了:
myspan = soup.find_all('span', attrs={'data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': '{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}', 'class': 'ng-scope ng-binding arrange'})
我完全被难住了。如果有人能指出我正确的方向,我将不胜感激。我觉得我完全错过了一些东西,可能有几件事......
您将 data_*
属性和值用于 select 跨度的方式没有任何问题。事实上它是 documentation 中提到的正确方法。有 4 个 span 标签匹配所有属性。 find_all
将 return 所有这些标签。第二个对应价格
您错过的是 span 需要一些时间才能加载,并且页面源在此之前 returned。您可以 explicitly wait 该跨度,然后获取页面源。这里我使用 Xpath 来等待元素。您可以通过转到 inspect tool -> right click element -> copy -> copy xpath
获取 xpath
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[3]/div[1]/div/div/table/tbody/tr[1]/td[2]/div/span[1]')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
myspan = soup.find_all('span', attrs={'data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': '{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}', 'class': 'ng-scope ng-binding arrange'})
print(myspan)
print(myspan[1].text)
输出
[<span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">Unit price as of 02/15/2019</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">.91</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">Change</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer"><span class="number-positive">[=11=].47</span> <span class="number-positive">1.11%</span></span>]
.91
Selenium 单独就足以提取所需的文本。您需要为 visibility_of_element_located
引入 WebDriverWait,您可以使用以下解决方案:
代码块:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//tr[@class='ng-scope']//td[@class='ng-scope right']//span[@class='ng-scope ng-binding arrange' and @data-ng-bind-html]"))).get_attribute("innerHTML"))
控制台输出:
.91
我想简单地获取 https://investor.vanguard.com/529-plan/profile/4514 中显示的证券价格。我运行这个代码:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'C:\Program_Files_EllieTheGoodDog\Geckodriver\geckodriver.exe')
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
当我在selenium打开Firefox中"inspect element"价格时,我清楚地看到了这个:
<span data-ng-if="!data.isLayer" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" class="ng-scope ng-binding arrange">.91</span >
但是这些数据不在我的汤里。如果我打印我的汤,html 与网站上显示的确实有很大不同。我试过了,但完全失败了:
myspan = soup.find_all('span', attrs={'data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': '{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}', 'class': 'ng-scope ng-binding arrange'})
我完全被难住了。如果有人能指出我正确的方向,我将不胜感激。我觉得我完全错过了一些东西,可能有几件事......
您将 data_*
属性和值用于 select 跨度的方式没有任何问题。事实上它是 documentation 中提到的正确方法。有 4 个 span 标签匹配所有属性。 find_all
将 return 所有这些标签。第二个对应价格
您错过的是 span 需要一些时间才能加载,并且页面源在此之前 returned。您可以 explicitly wait 该跨度,然后获取页面源。这里我使用 Xpath 来等待元素。您可以通过转到 inspect tool -> right click element -> copy -> copy xpath
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get('https://investor.vanguard.com/529-plan/profile/4514')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'/html/body/div[1]/div[3]/div[3]/div[1]/div/div[1]/div/div/div/div[2]/div/div[3]/div[1]/div/div/table/tbody/tr[1]/td[2]/div/span[1]')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
myspan = soup.find_all('span', attrs={'data-ng-if': '!data.isLayer', 'data-ng-bind-html': 'data.value', 'data-ng-class': '{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}', 'class': 'ng-scope ng-binding arrange'})
print(myspan)
print(myspan[1].text)
输出
[<span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">Unit price as of 02/15/2019</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">.91</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer">Change</span>, <span class="ng-scope ng-binding arrange" data-ng-bind-html="data.value" data-ng-class="{sceIsLayer : isETF, arrange : isMutualFund, arrangeSec : isETF}" data-ng-if="!data.isLayer"><span class="number-positive">[=11=].47</span> <span class="number-positive">1.11%</span></span>]
.91
Selenium 单独就足以提取所需的文本。您需要为 visibility_of_element_located
引入 WebDriverWait,您可以使用以下解决方案:
代码块:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe') driver.get('https://investor.vanguard.com/529-plan/profile/4514') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//tr[@class='ng-scope']//td[@class='ng-scope right']//span[@class='ng-scope ng-binding arrange' and @data-ng-bind-html]"))).get_attribute("innerHTML"))
控制台输出:
.91