BeautifulSoup - 在 ID 字段更改时抓取评论
BeautifulSoup - Scraping a comment when the ID field changes
我正在收集多个赛季的棒球比赛数据。这是数据的示例。
https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml
对于这个问题,我特意想办法把包含裁判和比赛数据的评论拉出来。请注意,这些 html 文件现在存储在本地,因此我正在尝试遍历一个文件夹。在源代码中它看起来像这样:
<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div><div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>
如您所见,它在评论中。真正的挑战在于 ID 值会随着场地和季节的变化而变化。我正在解析 10 年的数据。谁能告诉我如何在 ID 实际更改时提取评论文本?
这是我的代码:
# import libraries and files
from bs4 import BeautifulSoup, Comment
import os
print
# Setup Games list for append
games = []
path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"
for filename in os.listdir(path):
if filename.endswith(".html"):
fullpath = os.path.join(path, filename)
print 'Processing {:}...'.format(fullpath)
# Get Page, Make Soup
soup = BeautifulSoup(open(fullpath), 'lxml')
# Setting up game object to append to list
game = {}
# Get Description
# Note: Skip every other child because of 'Navigable Strings' from BS.
divs = soup.findAll('div', {'scorebox_meta'})
for div in divs:
for idx, child in enumerate(div.children):
if idx == 1:
game['date'] = child.text
elif idx == 3:
game['start_time'] = child.text.split(':', 1)[1].strip()
elif idx == 7:
game['venue'] = child.text.split(':', 1)[1].strip()
elif idx == 9:
game['duration'] = child.text.split(':', 1)[1].strip()
# Get Player Data from tables
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
data = BeautifulSoup(comment,"lxml")
for items in data.select("table tr"):
player_data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(player_data)
print '======================================================='
# Get Umpire Data
# Append game data to full list
games.append(game)
print
print 'Results'
print '*' * 80
# Print the games harvested to the console
for idx, game in enumerate(games):
print str(idx) + ': ' + str(game)
# Write to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(game)
非常感谢,
本尼
我使用re
模块提取评论部分:
from bs4 import BeautifulSoup
import re
data = """<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div>
<div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>"""
soup = BeautifulSoup(re.search(r'(?<=<!--)(.*?)(?=-->)', data, flags=re.DOTALL)[0], 'lxml')
umpires, time_of_game, attendance, start_time_weather = soup.select('div.section_content > div')
print('ID: ', soup.find('div', class_="section_content")['id'])
print('umpires: ', umpires.text)
print('time of game: ', time_of_game.text)
print('attendance: ', attendance.text)
print('start_time_weather: ', start_time_weather.text)
输出:
ID: div_342042674
umpires: Umpires: HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
time of game: Time of Game: 3:21.
attendance: Attendance: 33,809.
start_time_weather: Start Time Weather: 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.
如果把<!--
、-->
这些恶毒的符号从html元素中踢掉,就可以轻松访问内容了。这是你可以去的方式:
import requests
from bs4 import BeautifulSoup
url = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
res = requests.get(url)
content = res.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(content,"lxml")
umpire, gametime, attendance, weather = soup.find_all(class_="section_content")[2]("strong")
print(f'{umpire.next_sibling}\n{gametime.next_sibling}\n{attendance.next_sibling}\n{weather.next_sibling}\n')
输出:
HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
3:21.
33,809.
70° F, Wind 6mph out to Centerfield, Night, No Precipitation.
我正在收集多个赛季的棒球比赛数据。这是数据的示例。
https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml
对于这个问题,我特意想办法把包含裁判和比赛数据的评论拉出来。请注意,这些 html 文件现在存储在本地,因此我正在尝试遍历一个文件夹。在源代码中它看起来像这样:
<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div><div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>
如您所见,它在评论中。真正的挑战在于 ID 值会随着场地和季节的变化而变化。我正在解析 10 年的数据。谁能告诉我如何在 ID 实际更改时提取评论文本?
这是我的代码:
# import libraries and files
from bs4 import BeautifulSoup, Comment
import os
print
# Setup Games list for append
games = []
path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"
for filename in os.listdir(path):
if filename.endswith(".html"):
fullpath = os.path.join(path, filename)
print 'Processing {:}...'.format(fullpath)
# Get Page, Make Soup
soup = BeautifulSoup(open(fullpath), 'lxml')
# Setting up game object to append to list
game = {}
# Get Description
# Note: Skip every other child because of 'Navigable Strings' from BS.
divs = soup.findAll('div', {'scorebox_meta'})
for div in divs:
for idx, child in enumerate(div.children):
if idx == 1:
game['date'] = child.text
elif idx == 3:
game['start_time'] = child.text.split(':', 1)[1].strip()
elif idx == 7:
game['venue'] = child.text.split(':', 1)[1].strip()
elif idx == 9:
game['duration'] = child.text.split(':', 1)[1].strip()
# Get Player Data from tables
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
data = BeautifulSoup(comment,"lxml")
for items in data.select("table tr"):
player_data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(player_data)
print '======================================================='
# Get Umpire Data
# Append game data to full list
games.append(game)
print
print 'Results'
print '*' * 80
# Print the games harvested to the console
for idx, game in enumerate(games):
print str(idx) + ': ' + str(game)
# Write to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(game)
非常感谢, 本尼
我使用re
模块提取评论部分:
from bs4 import BeautifulSoup
import re
data = """<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div>
<div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>"""
soup = BeautifulSoup(re.search(r'(?<=<!--)(.*?)(?=-->)', data, flags=re.DOTALL)[0], 'lxml')
umpires, time_of_game, attendance, start_time_weather = soup.select('div.section_content > div')
print('ID: ', soup.find('div', class_="section_content")['id'])
print('umpires: ', umpires.text)
print('time of game: ', time_of_game.text)
print('attendance: ', attendance.text)
print('start_time_weather: ', start_time_weather.text)
输出:
ID: div_342042674
umpires: Umpires: HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
time of game: Time of Game: 3:21.
attendance: Attendance: 33,809.
start_time_weather: Start Time Weather: 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.
如果把<!--
、-->
这些恶毒的符号从html元素中踢掉,就可以轻松访问内容了。这是你可以去的方式:
import requests
from bs4 import BeautifulSoup
url = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
res = requests.get(url)
content = res.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(content,"lxml")
umpire, gametime, attendance, weather = soup.find_all(class_="section_content")[2]("strong")
print(f'{umpire.next_sibling}\n{gametime.next_sibling}\n{attendance.next_sibling}\n{weather.next_sibling}\n')
输出:
HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
3:21.
33,809.
70° F, Wind 6mph out to Centerfield, Night, No Precipitation.