Beautiful Soup BS4 "data-foo" 标签之间的关联文本未显示
Beautiful Soup BS4 "data-foo" associated text between tags not displaying
来自这个标签:
<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000">Sat 13 Aug 2011</div>
我想使用 bs4 Beautiful Soup 提取 "Sat 13 Aug 2011"。
我当前的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.premierleague.com/match/7468'
j = requests.get(url)
soup = BeautifulSoup(j.content, "lxml")
containedDateTag_string = soup.find_all('div', class_="matchDate renderMatchDateContainer")
print (containedDateTag_string)
当我 运行 时,打印输出不包含 "Sat 13 Aug 2011" 并且简单地存储和打印为:
[<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000"></div>]
有什么方法可以显示这个字符串吗?我还尝试通过带有“.next_sibling”和“.text”的标签进一步解析,同时显示“[]”而不是所需的字符串,这就是为什么我恢复到只尝试 'div' 的原因看看我是否至少可以显示文本。
使用 .page_source
使用 selenium
/ChromeDriver
抓取内容是这里的方法,因为日期文本是由 JavaScript:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.premierleague.com/match/7468"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
然后您可以.find
按照您之前的方式进行:
>>> soup.find('div', {'class':"matchDate renderMatchDateContainer"}).text
'Sat 13 Aug 2011'
A 电池内含硒溶液:
>>> driver.find_element_by_css_selector("div.matchDate.renderMatchDateContainer").text
'Sat 13 Aug 2011'
没有 Selenium - 但使用请求和网站自己 API - 它看起来像这样(当然,你会获取关于每个游戏的一堆其他数据,但这里只是代码date-part):
import requests
from time import sleep
def scraper(match_id):
headers = {
"Origin":"https://www.premierleague.com",
"Referer":"https://www.premierleague.com/match/%d" % match_id
}
api_endpoint = "https://footballapi.pulselive.com/football/broadcasting-schedule/fixtures/%d" % match_id
r = requests.get(api_endpoint, headers=headers)
if not r.status_code == 200:
return None
else:
data = r.json()
# this will return something like this:
# {'broadcasters': [],
# 'fixture': {'attendance': 25700,
# 'clock': {'label': "90 +4'00", 'secs': 5640},
# 'gameweek': {'gameweek': 1, 'id': 744},
# 'ground': {'city': 'London', 'id': 16, 'name': 'Craven Cottage'},
# 'id': 7468,
# 'kickoff': {'completeness': 3,
# 'gmtOffset': 1.0,
# 'label': 'Sat 13 Aug 2011, 15:00 BST',
# 'millis': 1313244000000},
# 'neutralGround': False,
# 'outcome': 'D',
# 'phase': 'F',
# 'replay': False,
# 'status': 'C',
# 'teams': [{'score': 0,
# 'team': {'club': {'abbr': 'FUL',
# 'id': 34,
# 'name': 'Fulham'},
# 'id': 34,
# 'name': 'Fulham',
# 'shortName': 'Fulham',
# 'teamType': 'FIRST'}},
# {'score': 0,
# 'team': {'club': {'abbr': 'AVL',
# 'id': 2,
# 'name': 'Aston Villa'},
# 'id': 2,
# 'name': 'Aston Villa',
# 'shortName': 'Aston Villa',
# 'teamType': 'FIRST'}}]}}
return data
match_id = 7468
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
您需要 header 和这两个参数来获取数据。
所以如果你有一堆 match_id,你可以用这个函数循环遍历它们:
for match_id in range(7000,8000,1):
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
sleep(1)
来自这个标签:
<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000">Sat 13 Aug 2011</div>
我想使用 bs4 Beautiful Soup 提取 "Sat 13 Aug 2011"。
我当前的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.premierleague.com/match/7468'
j = requests.get(url)
soup = BeautifulSoup(j.content, "lxml")
containedDateTag_string = soup.find_all('div', class_="matchDate renderMatchDateContainer")
print (containedDateTag_string)
当我 运行 时,打印输出不包含 "Sat 13 Aug 2011" 并且简单地存储和打印为:
[<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000"></div>]
有什么方法可以显示这个字符串吗?我还尝试通过带有“.next_sibling”和“.text”的标签进一步解析,同时显示“[]”而不是所需的字符串,这就是为什么我恢复到只尝试 'div' 的原因看看我是否至少可以显示文本。
使用 .page_source
使用 selenium
/ChromeDriver
抓取内容是这里的方法,因为日期文本是由 JavaScript:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.premierleague.com/match/7468"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
然后您可以.find
按照您之前的方式进行:
>>> soup.find('div', {'class':"matchDate renderMatchDateContainer"}).text
'Sat 13 Aug 2011'
A 电池内含硒溶液:
>>> driver.find_element_by_css_selector("div.matchDate.renderMatchDateContainer").text
'Sat 13 Aug 2011'
没有 Selenium - 但使用请求和网站自己 API - 它看起来像这样(当然,你会获取关于每个游戏的一堆其他数据,但这里只是代码date-part):
import requests
from time import sleep
def scraper(match_id):
headers = {
"Origin":"https://www.premierleague.com",
"Referer":"https://www.premierleague.com/match/%d" % match_id
}
api_endpoint = "https://footballapi.pulselive.com/football/broadcasting-schedule/fixtures/%d" % match_id
r = requests.get(api_endpoint, headers=headers)
if not r.status_code == 200:
return None
else:
data = r.json()
# this will return something like this:
# {'broadcasters': [],
# 'fixture': {'attendance': 25700,
# 'clock': {'label': "90 +4'00", 'secs': 5640},
# 'gameweek': {'gameweek': 1, 'id': 744},
# 'ground': {'city': 'London', 'id': 16, 'name': 'Craven Cottage'},
# 'id': 7468,
# 'kickoff': {'completeness': 3,
# 'gmtOffset': 1.0,
# 'label': 'Sat 13 Aug 2011, 15:00 BST',
# 'millis': 1313244000000},
# 'neutralGround': False,
# 'outcome': 'D',
# 'phase': 'F',
# 'replay': False,
# 'status': 'C',
# 'teams': [{'score': 0,
# 'team': {'club': {'abbr': 'FUL',
# 'id': 34,
# 'name': 'Fulham'},
# 'id': 34,
# 'name': 'Fulham',
# 'shortName': 'Fulham',
# 'teamType': 'FIRST'}},
# {'score': 0,
# 'team': {'club': {'abbr': 'AVL',
# 'id': 2,
# 'name': 'Aston Villa'},
# 'id': 2,
# 'name': 'Aston Villa',
# 'shortName': 'Aston Villa',
# 'teamType': 'FIRST'}}]}}
return data
match_id = 7468
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
您需要 header 和这两个参数来获取数据。 所以如果你有一堆 match_id,你可以用这个函数循环遍历它们:
for match_id in range(7000,8000,1):
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
sleep(1)