如何在 Python 中抓取预定的 WebEx 会议
How to Scrape Scheduled WebEx Meetings in Python
如何列出已安排的 WebEx 会议?
这是 WebElements
元素:
<div class="meeting_list"><div role="region" aria-label="My Webex Meetings list" class="meeting_list_content" style=""><!----> <!----> <!----> <div role="grid" aria-relevant="additions removals" class="m_list" style=""><div class="m_list_item m_list_item_0"><div class="col col_1 col_0_1"><span class="avatar_img avatar_small" style="background-image: url("https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64");"><img src="https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64" alt="Avatar Picture"></span></div> <div class="col col_2 col_0_2"><div><div class="list_t">
11:00 AM - 12:00 PM
</div> <div class="list_st">Wed, Apr 8</div></div></div> <div class="col col_3 col_0_3"><div class="list_t"><div class="meeting_topic meetings"><a href="javascript:void(0)" title="#3446655 Instance hangs on migration, virsh commands timedout" class="">
#3446655 Instance hangs on migration, virsh commands timedout
</a></div> <div class="back meeting_topic_column"><!----> <!----> <!----> <!----> <!----> <span><!----></span></div></div> <div class="list_st">Mirantis Operations </div></div> <div class="col col_4 col_0_4"><span class="list_btn"><button type="button" class="el-button el-button--success" aria-label="Press enter to Start the meeting."><!----><!----><span>Start</span></button></span></div></div><div class="m_list_item m_list_item_1"><div class="col col_1 col_1_1"><span class="avatar_img avatar_small" style="background-image: url("https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64");"><img src="https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64" alt="Avatar Picture"></span></div> <div class="col col_2 col_1_2"><div><div class="list_t">
12:00 PM - 1:00 PM
</div> <div class="list_st">Wed, Apr 8</div></div></div> <div class="col col_3 col_1_3"><div class="list_t"><div class="meeting_topic meetings"><a href="javascript:void(0)" title="00122550 EMEA Scanner is not scanning properly" class="">
00122550 EMEA Scanner is not scanning properly
</a></div> <div class="back meeting_topic_column"><!----> <!----> <!----> <!----> <!----> <span><!----></span></div></div> <div class="list_st">Mirantis Operations </div></div> <div class="col col_4 col_1_4"><span class="list_btn"><button type="button" class="el-button el-button--success" aria-label="Press enter to Start the meeting."><!----><!----><span>Start</span></button></span></div></div> <div class="infinite-loading-container"><div style="display: none;"><i class="loading-spiral"></i></div> <div class="infinite-status-prompt" style="display: none;"><span></span></div> <div class="infinite-status-prompt" style=""><span></span></div></div></div> <div><!----> <div><!----></div></div><div class="el-loading-mask" style="display: none;"><div class="el-loading-spinner"><svg viewBox="25 25 50 50" class="circular"><circle cx="50" cy="50" r="20" stroke-width="2" stroke="#D1D3D7" fill="none"></circle><circle cx="50" cy="50" r="20" fill="none" class="path"></circle></svg><!----></div></div></div></div>
如您所见,这列出了 HTML 中可用的会议。
但是,我只想得到一份干净的清单。
这是定义的 XPATH:
//*[@id="main_content"]/div[1]/div/div/div[2]/div
有什么方法可以在 python 中抓取这些信息?
我已经能够获得成功的请求,但文档似乎很少与 WebEx 的 GET 请求一起使用。
这是我的代码,至少可以让我成功通过身份验证。
#!/usr/bin/env python
import json
import logging
import pandas as pd
import os
import requests
import subprocess
import lxml
import time
import unittest
from requests.auth import HTTPBasicAuth
from lxml import html
with open('secrets.json','r') as f:
config = json.load(f)
requests.get('https://mirantis.webex.com',
auth=HTTPBasicAuth(config['username']['password']))
# GET Page Source
page = requests.get('https://mirantis.webex.com/webappng/sites/mirantis/meeting/home')
tree = html.fromstring(page.content)
# GET Meetings
meetings = tree.xpath('//*[@id="main_content"]/div[1]/div/div/div[2]/div/div')
meetings
我想在数据框中列出这个可能 Pandas:
然而,我得到的是什么都没有……它只是空的。
通常我希望看到结果,但什么也没有:
>>> page = requests.get('https://mirantis.webex.com/webappng/sites/mirantis/dashboard?siteurl=mirantis')
>>> tree = html.fromstring(page.content)
>>> meetings = tree.xpath('//*[@id="main_content"]/div/div[1]/div[2]/div/div')
>>> meetings
[]
当我使用 page.content
时,我得到一个 HTML 正文,我是不是抓错了?
正如我在评论中提到的,尝试使用 selenium 提取此数据:
使用 pip
通过 pip install selenium
安装
您还需要从以下位置下载 chrome(如果这是您选择的驱动程序)
https://chromedriver.chromium.org/downloads
并调整我添加的代码片段,以指向正确的 chrome 驱动程序二进制文件
from selenium.webdriver import Chrome
url = "https://mirantis.webex.com/webappng/sites/mirantis/dashboard?siteurl=mirantis"
chrome_driver_path = "<path_to_chrome_driver>"
xpath_pattern = '//*[@id="main_content"]/div/div[1]/div[2]/div/div'
def find_meetings(driver, pattern):
meetings = driver.find_elements_by_xpath(pattern)
#do something here
driver = Chrome(chrome_driver_path)
try:
driver.get(url)
find_meetings(driver, xpath_pattern)
finally:
driver.close()
如何列出已安排的 WebEx 会议? 这是 WebElements
元素:
<div class="meeting_list"><div role="region" aria-label="My Webex Meetings list" class="meeting_list_content" style=""><!----> <!----> <!----> <div role="grid" aria-relevant="additions removals" class="m_list" style=""><div class="m_list_item m_list_item_0"><div class="col col_1 col_0_1"><span class="avatar_img avatar_small" style="background-image: url("https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64");"><img src="https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64" alt="Avatar Picture"></span></div> <div class="col col_2 col_0_2"><div><div class="list_t">
11:00 AM - 12:00 PM
</div> <div class="list_st">Wed, Apr 8</div></div></div> <div class="col col_3 col_0_3"><div class="list_t"><div class="meeting_topic meetings"><a href="javascript:void(0)" title="#3446655 Instance hangs on migration, virsh commands timedout" class="">
#3446655 Instance hangs on migration, virsh commands timedout
</a></div> <div class="back meeting_topic_column"><!----> <!----> <!----> <!----> <!----> <span><!----></span></div></div> <div class="list_st">Mirantis Operations </div></div> <div class="col col_4 col_0_4"><span class="list_btn"><button type="button" class="el-button el-button--success" aria-label="Press enter to Start the meeting."><!----><!----><span>Start</span></button></span></div></div><div class="m_list_item m_list_item_1"><div class="col col_1 col_1_1"><span class="avatar_img avatar_small" style="background-image: url("https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64");"><img src="https://mirantis.webex.com/avatarservice/v1/users/520902917/avatars/390b91c0-571c-4889-a708-57c891b44315?siteurl=mirantis&size=64" alt="Avatar Picture"></span></div> <div class="col col_2 col_1_2"><div><div class="list_t">
12:00 PM - 1:00 PM
</div> <div class="list_st">Wed, Apr 8</div></div></div> <div class="col col_3 col_1_3"><div class="list_t"><div class="meeting_topic meetings"><a href="javascript:void(0)" title="00122550 EMEA Scanner is not scanning properly" class="">
00122550 EMEA Scanner is not scanning properly
</a></div> <div class="back meeting_topic_column"><!----> <!----> <!----> <!----> <!----> <span><!----></span></div></div> <div class="list_st">Mirantis Operations </div></div> <div class="col col_4 col_1_4"><span class="list_btn"><button type="button" class="el-button el-button--success" aria-label="Press enter to Start the meeting."><!----><!----><span>Start</span></button></span></div></div> <div class="infinite-loading-container"><div style="display: none;"><i class="loading-spiral"></i></div> <div class="infinite-status-prompt" style="display: none;"><span></span></div> <div class="infinite-status-prompt" style=""><span></span></div></div></div> <div><!----> <div><!----></div></div><div class="el-loading-mask" style="display: none;"><div class="el-loading-spinner"><svg viewBox="25 25 50 50" class="circular"><circle cx="50" cy="50" r="20" stroke-width="2" stroke="#D1D3D7" fill="none"></circle><circle cx="50" cy="50" r="20" fill="none" class="path"></circle></svg><!----></div></div></div></div>
如您所见,这列出了 HTML 中可用的会议。 但是,我只想得到一份干净的清单。 这是定义的 XPATH:
//*[@id="main_content"]/div[1]/div/div/div[2]/div
有什么方法可以在 python 中抓取这些信息? 我已经能够获得成功的请求,但文档似乎很少与 WebEx 的 GET 请求一起使用。
这是我的代码,至少可以让我成功通过身份验证。
#!/usr/bin/env python
import json
import logging
import pandas as pd
import os
import requests
import subprocess
import lxml
import time
import unittest
from requests.auth import HTTPBasicAuth
from lxml import html
with open('secrets.json','r') as f:
config = json.load(f)
requests.get('https://mirantis.webex.com',
auth=HTTPBasicAuth(config['username']['password']))
# GET Page Source
page = requests.get('https://mirantis.webex.com/webappng/sites/mirantis/meeting/home')
tree = html.fromstring(page.content)
# GET Meetings
meetings = tree.xpath('//*[@id="main_content"]/div[1]/div/div/div[2]/div/div')
meetings
我想在数据框中列出这个可能 Pandas:
然而,我得到的是什么都没有……它只是空的。 通常我希望看到结果,但什么也没有:
>>> page = requests.get('https://mirantis.webex.com/webappng/sites/mirantis/dashboard?siteurl=mirantis')
>>> tree = html.fromstring(page.content)
>>> meetings = tree.xpath('//*[@id="main_content"]/div/div[1]/div[2]/div/div')
>>> meetings
[]
当我使用 page.content
时,我得到一个 HTML 正文,我是不是抓错了?
正如我在评论中提到的,尝试使用 selenium 提取此数据:
使用 pip
通过 pip install selenium
您还需要从以下位置下载 chrome(如果这是您选择的驱动程序)
https://chromedriver.chromium.org/downloads
并调整我添加的代码片段,以指向正确的 chrome 驱动程序二进制文件
from selenium.webdriver import Chrome
url = "https://mirantis.webex.com/webappng/sites/mirantis/dashboard?siteurl=mirantis"
chrome_driver_path = "<path_to_chrome_driver>"
xpath_pattern = '//*[@id="main_content"]/div/div[1]/div[2]/div/div'
def find_meetings(driver, pattern):
meetings = driver.find_elements_by_xpath(pattern)
#do something here
driver = Chrome(chrome_driver_path)
try:
driver.get(url)
find_meetings(driver, xpath_pattern)
finally:
driver.close()