Beautiful Soup BS4 "data-foo" 标签之间的关联文本未显示

Beautiful Soup BS4 "data-foo" associated text between tags not displaying

来自这个标签:

<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000">Sat 13 Aug 2011</div>

我想使用 bs4 Beautiful Soup 提取 "Sat 13 Aug 2011"。

我当前的代码:

import requests
from bs4 import BeautifulSoup
url = 'https://www.premierleague.com/match/7468'
j = requests.get(url)
soup = BeautifulSoup(j.content, "lxml")

containedDateTag_string = soup.find_all('div', class_="matchDate renderMatchDateContainer")
print (containedDateTag_string)

当我 运行 时,打印输出不包含 "Sat 13 Aug 2011" 并且简单地存储和打印为:

[<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000"></div>]

有什么方法可以显示这个字符串吗?我还尝试通过带有“.next_sibling”和“.text”的标签进一步解析,同时显示“[]”而不是所需的字符串,这就是为什么我恢复到只尝试 'div' 的原因看看我是否至少可以显示文本。

使用 .page_source 使用 selenium/ChromeDriver 抓取内容是这里的方法,因为日期文本是由 JavaScript:

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://www.premierleague.com/match/7468"
driver = webdriver.Chrome()
driver.get(url)

soup = BeautifulSoup(driver.page_source, 'lxml')

然后您可以.find按照您之前的方式进行:

>>> soup.find('div', {'class':"matchDate renderMatchDateContainer"}).text

'Sat 13 Aug 2011'

A 电池内含硒溶液:

>>> driver.find_element_by_css_selector("div.matchDate.renderMatchDateContainer").text
'Sat 13 Aug 2011'

没有 Selenium - 但使用请求和网站自己 API - 它看起来像这样(当然,你会获取关于每个游戏的一堆其他数据,但这里只是代码date-part):

import requests
from time import sleep

def scraper(match_id):
    headers = {
    "Origin":"https://www.premierleague.com",
    "Referer":"https://www.premierleague.com/match/%d" % match_id
    }

    api_endpoint = "https://footballapi.pulselive.com/football/broadcasting-schedule/fixtures/%d" % match_id
    r = requests.get(api_endpoint, headers=headers)
    if not r.status_code == 200:
        return None
    else:
        data = r.json()
        # this will return something like this:
        # {'broadcasters': [],
        #  'fixture': {'attendance': 25700,
        #              'clock': {'label': "90 +4'00", 'secs': 5640},
        #              'gameweek': {'gameweek': 1, 'id': 744},
        #              'ground': {'city': 'London', 'id': 16, 'name': 'Craven Cottage'},
        #              'id': 7468,
        #              'kickoff': {'completeness': 3,
        #                          'gmtOffset': 1.0,
        #                          'label': 'Sat 13 Aug 2011, 15:00 BST',
        #                          'millis': 1313244000000},
        #              'neutralGround': False,
        #              'outcome': 'D',
        #              'phase': 'F',
        #              'replay': False,
        #              'status': 'C',
        #              'teams': [{'score': 0,
        #                         'team': {'club': {'abbr': 'FUL',
        #                                           'id': 34,
        #                                           'name': 'Fulham'},
        #                                  'id': 34,
        #                                  'name': 'Fulham',
        #                                  'shortName': 'Fulham',
        #                                  'teamType': 'FIRST'}},
        #                        {'score': 0,
        #                         'team': {'club': {'abbr': 'AVL',
        #                                           'id': 2,
        #                                           'name': 'Aston Villa'},
        #                                  'id': 2,
        #                                  'name': 'Aston Villa',
        #                                  'shortName': 'Aston Villa',
        #                                  'teamType': 'FIRST'}}]}}

        return data

match_id = 7468
json_blob = scraper(match_id)
if json_blob is not None:
    date = json_blob['fixture']['kickoff']['label']
    print(date)

您需要 header 和这两个参数来获取数据。 所以如果你有一堆 match_id,你可以用这个函数循环遍历它们:

for match_id in range(7000,8000,1):
    json_blob = scraper(match_id)
    if json_blob is not None:
            date = json_blob['fixture']['kickoff']['label']
            print(date)
            sleep(1)