使用 Selenium 单击页面并从路由页面中抓取信息

Question

我正在做一个项目来分析 SuperCluster Astronaut Database。我正在尝试将每个宇航员的数据抓取到一个漂亮、干净的 pandas 数据框中。有大量关于每位宇航员的描述性信息可供抓取。但是，当您单击宇航员时，会显示更多信息 - 您可以获得他们传记的几段内容。我想抓取它，但需要自动执行某些点击 link 的操作，然后从我路由到的页面抓取数据。

到目前为止，这是我的尝试：

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time



data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)

bio_data = []

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    for i in name:
        btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
        bio = item.select_one('px1.pb1').get_text()
        bio_data.append([bio])
        
    data.append([name,bio_data])



cols=['name','bio']
df = pd.DataFrame(data,columns=cols)

print(df)

我收到一条错误消息：

InvalidSessionIdException: Message: invalid session id

不确定如何解决这个问题。有人可以帮我指出正确的方向吗？如有任何帮助，我们将不胜感激！

Answer 1

InvalidSessionIdException

InvalidSessionIdException 发生在给定会话 ID 不在活动会话列表中的情况下，这表明会话不存在或会话不活动。

这个用例

可能 Selenium 驱动 ChromeDriver 已启动 google-chrome-headless Browsing Context is as a bot 并且会话正在终止。

参考资料

您可以在以下位置找到一些相关的详细讨论：

selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python

Answer 2

每个 link 包含单独的页面和生物数据。所以没有点击，你必须收集每个 url 并且必须再次发送请求以从 each/individual 页面收集生物数据。

脚本：

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
    name =name.text
    Name.append(name)
    #print(name)
    urls=soup.select('a[class="astronaut_cell x"]')
    for url in urls:
        abs_url='https://www.supercluster.com'+url.get('href')
        print(abs_url)
        options = Options()
        options.add_argument("--headless")
        driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
        driver.maximize_window()
        driver.get(abs_url)
        time.sleep(5)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.close()

        for astro in soup.select('div.h4')[0:8]:
            astro=astro.text
            bio.append(astro)


df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)

输出：

      name                                                    bio
0        Nield, George                                    b. Jul 31, 1950
1         Kitchen, Jim                                              Human
2            Lai, Gary                                               Male
3          Hagle, Marc            President Commercial Space Technologies
4        Hagle, Sharon                                    b. Jul 31, 1950
..                 ...                                                ...
295  Wilcutt, Terrence                           Lead Operations Engineer
296    Linenger, Jerry                                     b. Oct 1, 1975
297      Mukai, Chiaki                                              Human
298     Thomas, Donald                                               Male
299       Chiao, Leroy  People's Liberation Army Air Force Data Missin...

[300 rows x 2 columns]

使用 Selenium 单击页面并从路由页面中抓取信息

Using Selenium to click page and scrape Info from routed page

python

selenium

beautifulsoup

selenium-chromedriver

google-chrome-headless

InvalidSessionIdException

这个用例

参考资料