如何使用 Selenium 和 Python 从 cricinfo 网站抓取数据以对每场比赛的第一局进行评论修改过滤器

How to scrape the data from the cricinfo website for commentary of the first innings of every match modifying a filter using Selenium and Python

大家好,我一直在尝试从 cricinfo 网站上抓取一些数据,以获取每场比赛的评论。我能够获得第二局的完整数据..但对于第一局无法这样做,因为下拉菜单似乎没有选项或任何东西,例如 select class when我检查了源代码。如果有人可以建议一些选项来执行此操作,那就太好了。这是页面 https://www.espncricinfo.com/series/8048/commentary/1181768/mumbai-indians-vs-chennai-super-kings-final-indian-premier-league-2019[enter image description here]1

的 URL

数据通过JavaScript动态加载。您可以使用 requests/json 模块将数据加载到 Python:

import re
import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.espncricinfo.com/series/8048/commentary/1181768/mumbai-indians-vs-chennai-super-kings-final-indian-premier-league-2019'
api_url = 'https://hsapi.espncricinfo.com/v1/pages/match/comments?lang=en&leagueId={leagueId}&eventId={eventId}&liveTest=false&filter=full&page={page}'

leagueId, eventId = re.findall(r'(\d+)/commentary/(\d+)', url)[0]

page = 1
while True:
    data = requests.get(api_url.format(page=page, leagueId=leagueId, eventId=eventId)).json()

    # uncomment next line to see all data:
    # print(json.dumps(data, indent=4))

    # print some data to screen:
    for comment in data['comments']:
        soup1 = BeautifulSoup(comment['preText'], 'html.parser')
        soup2 = BeautifulSoup(comment['text'], 'html.parser')
        soup3 = BeautifulSoup(comment['postText'], 'html.parser')

        print(soup1.get_text(strip=True, separator='\n'))
        print(soup2.get_text(strip=True, separator='\n'))
        print(soup3.get_text(strip=True, separator='\n'))

        print('-' * 80)

    page += 1

    if page > data['pagination']['pageCount']:
        break

打印:

...

final ball. Can Mumbai cross 150? Pollard needs a six.
slower ball, full outside off, and
that's been smoked!
Drilled through the covers and
Chennai Super Kings 150 to win IPL 2019!
9.16pm
Another ravishing innings from Pollard against CSK in an IPL final. But will 150 be enough on this ground? Mumbai's innings was a stop-start one, with regular wickets ensuring they could never really accelerate. Deepak Chahar was excellent in his final three overs too, but Mumbai have two epic fast bowlers as well. Which team will win their fourth IPL title? We'll find out with Shashank Kishore when the second innings gets underway in a few minutes.
Shardul Thakur:
"Final game, best two teams in the IPL. We knew some hard cricket was going to happen. I feel Powerplay is where you can attack and take wicket. If you bowl defensively in the Powerplay, you will still get hit for fours and sixes. In the last game, I wanted to get early wickets but there was some good cricket played by Dhawan. But tonight, ball was swinging a bit. Rohit did hit me for a six, but idea wasn't to go away from my plan."
Raja: "@Vignesh That team did not have Dhoni as CAPTAIN"
Vignesh: "@Husen well , MI defended an even more low total in the same ground in 2017 finals against a team that had Dhoni ;)"
Satyam: "Think MI are 20-25 runs short here. At least 15 more would have been more defendable."
Divya: "Last 12 balls : 3 fours 3 wickets 6 dots 1 singles"
Mustafa Moudi: "If anyone feels this is a below-par score then let me remind everyone that MI defended 137 on this same ground and that too by a massive 40 runs and defeated the Home Team in this season !!"
Husen: "@Moustafa - That team did not have a Dhoni "


...

从 cricinfo 网站抓取数据以获取每场比赛第一局的评论修改过滤器使用 you need to induce for the visibility_of_element_located() and you can use the following :

  • 使用XPATH:

    # -*- coding: utf­-8 ­-*-
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-logging"])
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')   driver.get('https://www.espncricinfo.com/series/8048/commentary/1181768/mumbai-indians-vs-chennai-super-kings-final-indian-premier-league-2019')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='comment-container-head']/div/div/div/div"))).click()
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class, 'ci-dd__menu')]/div[contains(@class, 'ci-dd__menu-list')]/div[contains(@class, 'ci-dd__option') and text()='MI Innings']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='match-comment-long-text match-comment-padder']/span"))).text)
    
  • 控制台输出:

    9.16pm Another ravishing innings from Pollard against CSK in an IPL final. But will 150 be enough on this ground? Mumbai's innings was a stop-start one, with regular wickets ensuring they could never really accelerate. Deepak Chahar was excellent in his final three overs too, but Mumbai have two epic fast bowlers as well. Which team will win their fourth IPL title? We'll find out with Shashank Kishore when the second innings gets underway in a few minutes.
    Shardul Thakur: "Final game, best two teams in the IPL. We knew some hard cricket was going to happen. I feel Powerplay is where you can attack and take wicket. If you bowl defensively in the Powerplay, you will still get hit for fours and sixes. In the last game, I wanted to get early wickets but there was some good cricket played by Dhawan. But tonight, ball was swinging a bit. Rohit did hit me for a six, but idea wasn't to go away from my plan."
    Raja: "@Vignesh That team did not have Dhoni as CAPTAIN"
    Vignesh: "@Husen well , MI defended an even more low total in the same ground in 2017 finals against a team that had Dhoni ;)"
    Satyam: "Think MI are 20-25 runs short here. At least 15 more would have been more defendable."
    Divya: "Last 12 balls : 3 fours 3 wickets 6 dots 1 singles"
    Mustafa Moudi: "If anyone feels this is a below-par score then let me remind everyone that MI defended 137 on this same ground and that too by a massive 40 runs and defeated the Home Team in this season !!"
    Husen: "@Moustafa - That team did not have a Dhoni "
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC