使用 Selenium 单击页面并从路由页面中抓取信息
Using Selenium to click page and scrape Info from routed page
我正在做一个项目来分析 SuperCluster Astronaut Database。我正在尝试将每个宇航员的数据抓取到一个漂亮、干净的 pandas 数据框中。有大量关于每位宇航员的描述性信息可供抓取。但是,当您单击宇航员时,会显示更多信息 - 您可以获得他们传记的几段内容。我想抓取它,但需要自动执行某些点击 link 的操作,然后从我路由到的页面抓取数据。
到目前为止,这是我的尝试:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)
bio_data = []
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
for i in name:
btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
bio = item.select_one('px1.pb1').get_text()
bio_data.append([bio])
data.append([name,bio_data])
cols=['name','bio']
df = pd.DataFrame(data,columns=cols)
print(df)
我收到一条错误消息:
InvalidSessionIdException: Message: invalid session id
不确定如何解决这个问题。有人可以帮我指出正确的方向吗?如有任何帮助,我们将不胜感激!
InvalidSessionIdException
InvalidSessionIdException 发生在给定会话 ID 不在活动会话列表中的情况下,这表明会话不存在或会话不活动。
这个用例
可能 Selenium 驱动 ChromeDriver 已启动 google-chrome-headless Browsing Context is as a bot 并且会话正在终止。
参考资料
您可以在以下位置找到一些相关的详细讨论:
- selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python
每个 link 包含单独的页面和生物数据。所以没有点击,你必须收集每个 url 并且必须再次发送请求以从 each/individual 页面收集生物数据。
脚本:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
name =name.text
Name.append(name)
#print(name)
urls=soup.select('a[class="astronaut_cell x"]')
for url in urls:
abs_url='https://www.supercluster.com'+url.get('href')
print(abs_url)
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(abs_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
for astro in soup.select('div.h4')[0:8]:
astro=astro.text
bio.append(astro)
df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)
输出:
name bio
0 Nield, George b. Jul 31, 1950
1 Kitchen, Jim Human
2 Lai, Gary Male
3 Hagle, Marc President Commercial Space Technologies
4 Hagle, Sharon b. Jul 31, 1950
.. ... ...
295 Wilcutt, Terrence Lead Operations Engineer
296 Linenger, Jerry b. Oct 1, 1975
297 Mukai, Chiaki Human
298 Thomas, Donald Male
299 Chiao, Leroy People's Liberation Army Air Force Data Missin...
[300 rows x 2 columns]
我正在做一个项目来分析 SuperCluster Astronaut Database。我正在尝试将每个宇航员的数据抓取到一个漂亮、干净的 pandas 数据框中。有大量关于每位宇航员的描述性信息可供抓取。但是,当您单击宇航员时,会显示更多信息 - 您可以获得他们传记的几段内容。我想抓取它,但需要自动执行某些点击 link 的操作,然后从我路由到的页面抓取数据。
到目前为止,这是我的尝试:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)
bio_data = []
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
for i in name:
btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
bio = item.select_one('px1.pb1').get_text()
bio_data.append([bio])
data.append([name,bio_data])
cols=['name','bio']
df = pd.DataFrame(data,columns=cols)
print(df)
我收到一条错误消息:
InvalidSessionIdException: Message: invalid session id
不确定如何解决这个问题。有人可以帮我指出正确的方向吗?如有任何帮助,我们将不胜感激!
InvalidSessionIdException
InvalidSessionIdException 发生在给定会话 ID 不在活动会话列表中的情况下,这表明会话不存在或会话不活动。
这个用例
可能 Selenium 驱动 ChromeDriver 已启动 google-chrome-headless Browsing Context is
参考资料
您可以在以下位置找到一些相关的详细讨论:
- selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python
每个 link 包含单独的页面和生物数据。所以没有点击,你必须收集每个 url 并且必须再次发送请求以从 each/individual 页面收集生物数据。
脚本:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
name =name.text
Name.append(name)
#print(name)
urls=soup.select('a[class="astronaut_cell x"]')
for url in urls:
abs_url='https://www.supercluster.com'+url.get('href')
print(abs_url)
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(abs_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
for astro in soup.select('div.h4')[0:8]:
astro=astro.text
bio.append(astro)
df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)
输出:
name bio
0 Nield, George b. Jul 31, 1950
1 Kitchen, Jim Human
2 Lai, Gary Male
3 Hagle, Marc President Commercial Space Technologies
4 Hagle, Sharon b. Jul 31, 1950
.. ... ...
295 Wilcutt, Terrence Lead Operations Engineer
296 Linenger, Jerry b. Oct 1, 1975
297 Mukai, Chiaki Human
298 Thomas, Donald Male
299 Chiao, Leroy People's Liberation Army Air Force Data Missin...
[300 rows x 2 columns]