有没有办法不重复股票代码？

Question

import json
from io import StringIO
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import time
from selenium import webdriver
import requests
import pandas as pd


PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

url = "https://thestockmarketwatch.com/markets/after-hours/trading.aspx"
driver.minimize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
Afterhours = soup.find_all('a', {'class': 'symbol'})

for a in Afterhours:
    print(a.text)
    print("")


driver.quit()

大家好，我正在编写这个 After-Hours Gapper scraper，我遇到了一些麻烦。股票代码正在自我重复。怎么只能得到网站显示的，重复的？

Answer 1

那是因为如果您查看“检查元素”选项卡并搜索 class 名称 symbol，您会得到超过 30 个结果，这意味着具有该名称的 class 元素比你要。

让我告诉你，看看这两张图片：

第一张图片包含您想要的数据，但第二张图片在相同的 class 中也包含相同的数据。所以你必须找到一种方法来区分这两者。可能有很多方法可以做到这一点，但我将与您分享我认为有用的一种方法。

import json
from io import StringIO
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager    # new import
import requests
import pandas as pd

# better way to initialize the browser and driver than specifying path everytime
option = webdriver.ChromeOptions()
# option.add_argument('--headless') # uncomment to run browser in background when not debugging
option.add_argument("--log-level=3")
option.add_experimental_option('excludeSwitches', ['enable-logging'])
CDM = ChromeDriverManager(log_level='0')
driver = webdriver.Chrome(CDM.install(), options=option)


# PATH = "C:\Program Files (x86)\chromedriver.exe"
# driver = webdriver.Chrome(PATH)

url = "https://thestockmarketwatch.com/markets/after-hours/trading.aspx"
driver.minimize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
Afterhours = soup.find_all('a', {'class': 'symbol'})
removed_duplicates =[]         

for a in Afterhours:
    if a.find_previous().get('style') == None:      # the difference between those two duplicates
        removed_duplicates.append(a)

for i in removed_duplicates:
    print(i.text)
    print()     # just an empty print() would print a new line
driver.quit()

您现在可能已经注意到，第一个标签没有任何 内联样式 但第二个标签有一些。使用 BeautifulSoup 的最大好处是它有助于顺利遍历，因此您可以在元素上上下移动以找到您需要的任何内容。

我还在您的代码中添加了一些改进和建议，如果您已经知道它们，请忽略它们。这是个好问题！

有没有办法不重复股票代码？

Is there a way to not repeat the stock Tickers?

python

beautifulsoup

repeat

web-scraping