使用 selenium 和 python 在网页网格内抓取 javascript 数据
Scraping javascript data within a grid of a webpage using selenium and python
我的问题是我需要包含来自网站 https://applipedia.paloaltonetworks.com 的子域的网格内的所有数据 -(数据包含 NAME、CATEGORY、SUBCATEGORY、RISK、TECHNOLOGY)。我需要的是 [示例:第 5 行:2ch 有 2 个子域 |_2ch-base 和 2ch-posting。像这样我只想获取所有具有子域的应用程序的列表]
正确的不是每当我尝试在行中添加任何内容时:
table =wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tbody#bodyScrollingTable tr')))
我遇到超时错误。
下面是我目前拥有的脚本,它从网格中获取所有数据,但我只需要应用程序并且它包含子域。[示例 2ch、2ch-base、2ch-posting]。我通过检查元素发现了一种模式,即所有没有子域的应用程序都有 ( ) 或者我们可以通过 () 字段,这对于所有具有子域的应用程序都很常见。任何解决此问题的帮助将不胜感激。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path = r'/Users/am/Downloads/chromedriver')
driver.maximize_window()
driver.get("https://applipedia.paloaltonetworks.com/")
wait = WebDriverWait(driver,30)
table =wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tbody#bodyScrollingTable tr')))
for tab in table:
print(tab.text)
根据 url https://applipedia.paloaltonetworks.com/
获取所有具有子域的应用程序的列表,您需要为所需的 引入 WebDriverWait元素可见,您可以使用以下解决方案:
代码块:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\ChromeDriver\chromedriver_win32\chromedriver.exe')
driver.get('https://applipedia.paloaltonetworks.com/')
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='btmTable' and @id='dataTable']//tbody[@id='bodyScrollingTable']//tr[not(@ottawagroup='0') and not(@ottawagroup='2')]/td/a")))
for element in elements:
print(element.get_attribute("innerHTML"))
控制台输出:
DevTools listening on ws://127.0.0.1:12927/devtools/browser/d4a5d576-a4b0-4a3d-959b-9d37aff36fc6
2ch
51.com
adobe-connect
adobe-connectnow
adobe-creative-cloud
aim
aim-express
ali-wangwang
amazon-cloud-drive
amazon-music
ameba-now
assembla
autodesk360
avaya-webalive
bacnet
baidu-hi
bebo
bitbucket
boxnet
buddybuddy
chinaren
cisco-spark
cloudapp
cloudforge
cloudinary
concur
confluence
convo
cyph
daum
dcinside
diameter
dnp3
dochub
docstoc
docusign
draw.io
dropbox
egnyte
evernote
facebook
fetion
filestack
flickr
flixwagon
fuze-meeting
gatherplace
genesys
git
github
gitlab
glassdoor
globalmeet
gmail
google-calendar
google-cloud-storage
google-docs
google-hangouts
google-plus
google-spaces
google-talk
google-translate
google-video
gotomypc
gotowebinar
gtp
hadoop
hightail
hipchat
hootsuite
huddle
hulu
hyves
iccp
icloud
iec-60870-5-104
imeet
imgur
instagram
instan-t
ip-messenger
ipsec
irc
issuu
itunes
jira
join-me
jumpshare
kaixin
kaixin001
kakaotalk
laiwang
landesk
linkedin
live-mesh
lotus-notes
lotuslive
lucidpress
mail.ru
mail.ru-agent
maytech
meebo
meetup
mega
mendeley
mercurial
mixi
modbus
ms-ds-smb
ms-lync
ms-office365
ms-onedrive
msn
myspace
nateon-im
netease-webdisk
netflix
ning
noteworthy
now-tv
odnoklassniki
onehub
owncloud
paltalk
pastebin
pcanywhere
pinterest
pivotaltracker
powow
prezi
proofhub
qik
qliksense-cloud
qq
quip
quora
rally-software
readytalk
reddit
rediffbol
renren
rtp
salesforce
sap-jam
screencast
scribd
second-life
secure-data-space
sendthisfile
service-now
sharefile
sharepoint
sharevault
showmax
siemens-s7
signiant
sina-uc
sina-weibo
skydrive
slack
slideshare
smartsheet
snmp
softros-messenger
solarwinds
soundcloud
sourceforge
spark-im
ss7-map
stocktwits
storify
subversion
surveymonkey
syncplicity
tableau
teamdrive
teamup-calendar
teamviewer
thwapr
torch-browser
trello
tumblr
twitter
uc-yun
viber
vimeo
vine
virustotal
vkontakte
vnc
watchdox
webex
wechat
weiyun
whatsapp
windows-azure
windows-defender-atp
workday
yahoo-im
yammer
youku
yousendit
youtube
yunpan360
yy-voice
zalo
zendesk
zenefits
zettahost
使用下面的代码,您可以快速清晰地获得包含子域的域列表:
WebDriverWait(driver, 20).until(EC. visibility_of_element_located((By.CSS_SELECTOR, "[ottawagroup='1'] a")))
domains = driver.execute_script("return [...document.querySelectorAll(\"[ottawagroup='1'] a\")].map(e=>e.textContent.trim())")
我的问题是我需要包含来自网站 https://applipedia.paloaltonetworks.com 的子域的网格内的所有数据 -(数据包含 NAME、CATEGORY、SUBCATEGORY、RISK、TECHNOLOGY)。我需要的是 [示例:第 5 行:2ch 有 2 个子域 |_2ch-base 和 2ch-posting。像这样我只想获取所有具有子域的应用程序的列表]
正确的不是每当我尝试在行中添加任何内容时:
table =wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tbody#bodyScrollingTable tr')))
我遇到超时错误。
下面是我目前拥有的脚本,它从网格中获取所有数据,但我只需要应用程序并且它包含子域。[示例 2ch、2ch-base、2ch-posting]。我通过检查元素发现了一种模式,即所有没有子域的应用程序都有 ( ) 或者我们可以通过 () 字段,这对于所有具有子域的应用程序都很常见。任何解决此问题的帮助将不胜感激。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path = r'/Users/am/Downloads/chromedriver')
driver.maximize_window()
driver.get("https://applipedia.paloaltonetworks.com/")
wait = WebDriverWait(driver,30)
table =wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tbody#bodyScrollingTable tr')))
for tab in table:
print(tab.text)
根据 url https://applipedia.paloaltonetworks.com/
获取所有具有子域的应用程序的列表,您需要为所需的 引入 WebDriverWait元素可见,您可以使用以下解决方案:
代码块:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = Options() options.add_argument("start-maximized") options.add_argument("disable-infobars") options.add_argument("--disable-extensions") options.add_argument("--disable-gpu") driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\ChromeDriver\chromedriver_win32\chromedriver.exe') driver.get('https://applipedia.paloaltonetworks.com/') elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='btmTable' and @id='dataTable']//tbody[@id='bodyScrollingTable']//tr[not(@ottawagroup='0') and not(@ottawagroup='2')]/td/a"))) for element in elements: print(element.get_attribute("innerHTML"))
控制台输出:
DevTools listening on ws://127.0.0.1:12927/devtools/browser/d4a5d576-a4b0-4a3d-959b-9d37aff36fc6 2ch 51.com adobe-connect adobe-connectnow adobe-creative-cloud aim aim-express ali-wangwang amazon-cloud-drive amazon-music ameba-now assembla autodesk360 avaya-webalive bacnet baidu-hi bebo bitbucket boxnet buddybuddy chinaren cisco-spark cloudapp cloudforge cloudinary concur confluence convo cyph daum dcinside diameter dnp3 dochub docstoc docusign draw.io dropbox egnyte evernote facebook fetion filestack flickr flixwagon fuze-meeting gatherplace genesys git github gitlab glassdoor globalmeet gmail google-calendar google-cloud-storage google-docs google-hangouts google-plus google-spaces google-talk google-translate google-video gotomypc gotowebinar gtp hadoop hightail hipchat hootsuite huddle hulu hyves iccp icloud iec-60870-5-104 imeet imgur instagram instan-t ip-messenger ipsec irc issuu itunes jira join-me jumpshare kaixin kaixin001 kakaotalk laiwang landesk linkedin live-mesh lotus-notes lotuslive lucidpress mail.ru mail.ru-agent maytech meebo meetup mega mendeley mercurial mixi modbus ms-ds-smb ms-lync ms-office365 ms-onedrive msn myspace nateon-im netease-webdisk netflix ning noteworthy now-tv odnoklassniki onehub owncloud paltalk pastebin pcanywhere pinterest pivotaltracker powow prezi proofhub qik qliksense-cloud qq quip quora rally-software readytalk reddit rediffbol renren rtp salesforce sap-jam screencast scribd second-life secure-data-space sendthisfile service-now sharefile sharepoint sharevault showmax siemens-s7 signiant sina-uc sina-weibo skydrive slack slideshare smartsheet snmp softros-messenger solarwinds soundcloud sourceforge spark-im ss7-map stocktwits storify subversion surveymonkey syncplicity tableau teamdrive teamup-calendar teamviewer thwapr torch-browser trello tumblr twitter uc-yun viber vimeo vine virustotal vkontakte vnc watchdox webex wechat weiyun whatsapp windows-azure windows-defender-atp workday yahoo-im yammer youku yousendit youtube yunpan360 yy-voice zalo zendesk zenefits zettahost
使用下面的代码,您可以快速清晰地获得包含子域的域列表:
WebDriverWait(driver, 20).until(EC. visibility_of_element_located((By.CSS_SELECTOR, "[ottawagroup='1'] a")))
domains = driver.execute_script("return [...document.querySelectorAll(\"[ottawagroup='1'] a\")].map(e=>e.textContent.trim())")