使用 BeautifulSoup 和 Selenium 的网页抓取网站不会检测网页中的 table 元素
Web scraping website with BeautifulSoup and Selenium won't detect table elements in webpage
我正在尝试在以下网站中检索包含投标的 table:
https://wbgeconsult2.worldbank.org/wbgec/index.html#$h=1582042296662
(点击link后,需要点击右上角的'Business Opportunities'才能进入table)
我尝试使用 pandas read_html、Selenium 和 BeautifulSoup,所有这些都失败了(它们根本没有检测到 table 元素)。
我还尝试在开发工具的网络选项卡中找到 link,但其中 none 似乎有效。
这可能吗?我做错了什么?
这是我的代码:
from selenium import webdriver
from selenium.webdriver import ActionChains
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests
from requests_html import HTMLSession
session = HTMLSession()
import re
URL='https://wbgeconsult2.worldbank.org/wbgec/index.html#$h=1582042296662'
#Enter Gecko driver path
driver=webdriver.Firefox(executable_path ='/Users/****/geckodriver')
driver.get(URL)
# driver.minimize_window()
opp_path='//*[@id="menu_publicads"]/a'
list_ch=driver.find_element_by_xpath(opp_path)
ActionChains(driver).click(list_ch).perform()
time.sleep(5)
sort_xpath='//*[@id="jqgh_selection_notification.publication_date"]'
list_ch=driver.find_element_by_xpath(sort_xpath)
ActionChains(driver).click(list_ch).perform()
time.sleep(5)
sort_xpath='//*[@id="jqgh_selection_notification.publication_date"]'
list_ch=driver.find_element_by_xpath(sort_xpath)
ActionChains(driver).click(list_ch).perform()
time.sleep(5)
re=requests.get(URL)
soup=BeautifulSoup(re.content,'lxml')
row=soup.findAll('td')
print(row)
ti=driver.find_elements_by_xpath('//tr')
for t in ti:
print(ti.text)
数据是通过 XML 请求从外部 URL 加载的。您可以使用此示例如何将数据加载和解析到 DataFrame 中:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://wbgeconsult2.worldbank.org/wbgect/gwproxy"
data = """<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><GetCurrentPublicNotifications xmlns="http://cordys.com/WBGEC/DBT_Selection_Notification/1.0"><NotifTypeId3 xmlns="undefined">3</NotifTypeId3><DS type="dsort"><selection_notification.eoi_deadline order="asc"></selection_notification.eoi_deadline></DS></GetCurrentPublicNotifications></soapenv:Body></soapenv:Envelope>"""
soup = BeautifulSoup(requests.post(url, data=data).content, "xml")
# uncomment this to print all data:
# print(soup.prettify())
data = []
for sn in soup.select("SELECTION_NOTIFICATION"):
d = {}
for tag in sn.find_all(recursive=False):
d[tag.name] = tag.get_text(strip=True)
data.append(d)
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
打印:
ID PUBLICATION_DATE EOI_DEADLINE LANGUAGE_OF_NOTICE ADVERTISE_UNTIL TITLE SELECTION_TYPE_NAME SELECTION_TYPE_ID SELECTION_NUMBER SOLICITATION_OR_FRAMEWORK SELECTION_STATUS_ID SELECTION_SUB_STATUS_ID
0 148625 2021-04-16T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Zanzibar PPP Diagnostic and Pipeline Firm 2 1274225 2 8
1 148536 2021-04-14T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Assessment of Institutional Capacity for Imple... Firm 2 1274123 2 8
2 148310 2021-04-12T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Albania Digital Jobs Pilot Firm 2 1273851 2 8
3 148399 2021-04-12T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 EaP - Green Financing for Transport Infrastruc... Firm 2 1273953 2 8
4 148448 2021-04-12T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Surveying LGBTI people in North Macedonia and ... Firm 2 1274001 2 8
5 148277 2021-04-14T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 SME FINANCE FORUM 2021 WEBSITES REVAMP Firm 2 1273810 2 8
...
并保存 data.csv
(来自 LibreOffice 的屏幕截图):
试试这个,这段代码将等到元素出现然后抓取文本。根据您的需要编辑此代码。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 30)
driver.get('https://wbgeconsult2.worldbank.org/wbgec/index.html#$h=1582042296662')
BusinessOpportunity = wait.until(
EC.visibility_of_element_located((By.XPATH, "//a[text()=\"Business Opportunities\"]"))).click()
TableRow = wait.until(
EC.presence_of_all_elements_located((By.XPATH, "//table[@id=\"notificationsGrid\"]/descendant::tr")))
for row in TableRow:
print(row.text)
我正在尝试在以下网站中检索包含投标的 table: https://wbgeconsult2.worldbank.org/wbgec/index.html#$h=1582042296662 (点击link后,需要点击右上角的'Business Opportunities'才能进入table)
我尝试使用 pandas read_html、Selenium 和 BeautifulSoup,所有这些都失败了(它们根本没有检测到 table 元素)。 我还尝试在开发工具的网络选项卡中找到 link,但其中 none 似乎有效。 这可能吗?我做错了什么?
这是我的代码:
from selenium import webdriver
from selenium.webdriver import ActionChains
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests
from requests_html import HTMLSession
session = HTMLSession()
import re
URL='https://wbgeconsult2.worldbank.org/wbgec/index.html#$h=1582042296662'
#Enter Gecko driver path
driver=webdriver.Firefox(executable_path ='/Users/****/geckodriver')
driver.get(URL)
# driver.minimize_window()
opp_path='//*[@id="menu_publicads"]/a'
list_ch=driver.find_element_by_xpath(opp_path)
ActionChains(driver).click(list_ch).perform()
time.sleep(5)
sort_xpath='//*[@id="jqgh_selection_notification.publication_date"]'
list_ch=driver.find_element_by_xpath(sort_xpath)
ActionChains(driver).click(list_ch).perform()
time.sleep(5)
sort_xpath='//*[@id="jqgh_selection_notification.publication_date"]'
list_ch=driver.find_element_by_xpath(sort_xpath)
ActionChains(driver).click(list_ch).perform()
time.sleep(5)
re=requests.get(URL)
soup=BeautifulSoup(re.content,'lxml')
row=soup.findAll('td')
print(row)
ti=driver.find_elements_by_xpath('//tr')
for t in ti:
print(ti.text)
数据是通过 XML 请求从外部 URL 加载的。您可以使用此示例如何将数据加载和解析到 DataFrame 中:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://wbgeconsult2.worldbank.org/wbgect/gwproxy"
data = """<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><GetCurrentPublicNotifications xmlns="http://cordys.com/WBGEC/DBT_Selection_Notification/1.0"><NotifTypeId3 xmlns="undefined">3</NotifTypeId3><DS type="dsort"><selection_notification.eoi_deadline order="asc"></selection_notification.eoi_deadline></DS></GetCurrentPublicNotifications></soapenv:Body></soapenv:Envelope>"""
soup = BeautifulSoup(requests.post(url, data=data).content, "xml")
# uncomment this to print all data:
# print(soup.prettify())
data = []
for sn in soup.select("SELECTION_NOTIFICATION"):
d = {}
for tag in sn.find_all(recursive=False):
d[tag.name] = tag.get_text(strip=True)
data.append(d)
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
打印:
ID PUBLICATION_DATE EOI_DEADLINE LANGUAGE_OF_NOTICE ADVERTISE_UNTIL TITLE SELECTION_TYPE_NAME SELECTION_TYPE_ID SELECTION_NUMBER SOLICITATION_OR_FRAMEWORK SELECTION_STATUS_ID SELECTION_SUB_STATUS_ID
0 148625 2021-04-16T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Zanzibar PPP Diagnostic and Pipeline Firm 2 1274225 2 8
1 148536 2021-04-14T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Assessment of Institutional Capacity for Imple... Firm 2 1274123 2 8
2 148310 2021-04-12T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Albania Digital Jobs Pilot Firm 2 1273851 2 8
3 148399 2021-04-12T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 EaP - Green Financing for Transport Infrastruc... Firm 2 1273953 2 8
4 148448 2021-04-12T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 Surveying LGBTI people in North Macedonia and ... Firm 2 1274001 2 8
5 148277 2021-04-14T00:00:00.0 2021-04-26T23:59:59.900000000 English 2021-04-26T23:59:59.0 SME FINANCE FORUM 2021 WEBSITES REVAMP Firm 2 1273810 2 8
...
并保存 data.csv
(来自 LibreOffice 的屏幕截图):
试试这个,这段代码将等到元素出现然后抓取文本。根据您的需要编辑此代码。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 30)
driver.get('https://wbgeconsult2.worldbank.org/wbgec/index.html#$h=1582042296662')
BusinessOpportunity = wait.until(
EC.visibility_of_element_located((By.XPATH, "//a[text()=\"Business Opportunities\"]"))).click()
TableRow = wait.until(
EC.presence_of_all_elements_located((By.XPATH, "//table[@id=\"notificationsGrid\"]/descendant::tr")))
for row in TableRow:
print(row.text)