权力的游戏维基百科 Python 爬虫
Game of Thrones Wikipedia Python scraper
嘿,我正在做第二个学校项目,在 BeautifulSoup 的帮助下是一个 Python 爬虫。好的,我的作业如下:我应该 assemble 一个从维基百科抓取日期并提供 GoT 所有季节的总视图的应用程序,如果该应用程序可以实现以下功能:显示总计总计之前的所有季节,还可以逐集给出总观看次数和总计,并在总计结束时给出总计。
像那样:
S01E1:2.22 百万
S02E2:2.20 百万
.
.
.
第 1 季的总观看次数:xy
总计:398.7 百万
不知何故我只管理了总计。
如果有人做过类似的事情请帮忙:)
非常感谢:
import re
import urllib
from BeautifulSoup import BeautifulSoup
wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
views = 0
for season in seasons:
season_url = 'https://en.wikipedia.org' + season['href']
season_html = urllib.urlopen(season_url).read()
season_content = BeautifulSoup(season_html)
episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
if episodes_table:
episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
if episode_rows:
for episode_row in episode_rows:
episode_views = episode_row.findAll('td')[-1]
views += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
print 'The total number of views is ' + str(views) + ' millions'
无需进行解析工作。我所要做的就是研究如何以问题中您想要的格式在屏幕上输出结果,更像是字符串操作。
代码:
import re
import urllib
from bs4 import BeautifulSoup
wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html, 'html.parser')
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
views = 0
total = 0
season_num = 1
for season in seasons:
season_url = 'https://en.wikipedia.org' + season['href']
season_html = urllib.urlopen(season_url).read()
season_content = BeautifulSoup(season_html,'html.parser')
episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
if episodes_table:
episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
if episode_rows:
episode_num = 1
for episode_row in episode_rows:
episode_views = episode_row.findAll('td')[-1]
views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
total += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
episode_num += 1
season_num += 1
print 'The total number of views is ' + str(total) + ' millions'
输出:
S1E1 : 2.22 Millions
S1E2 : 2.2 Millions
S1E3 : 2.44 Millions
S1E4 : 2.45 Millions
S1E5 : 2.58 Millions
S1E6 : 2.44 Millions
S1E7 : 2.4 Millions
S1E8 : 2.72 Millions
S1E9 : 2.66 Millions
S1E10 : 3.04 Millions
S2E1 : 3.86 Millions
S2E2 : 3.76 Millions
S2E3 : 3.77 Millions
S2E4 : 3.65 Millions
S2E5 : 3.9 Millions
S2E6 : 3.88 Millions
S2E7 : 3.69 Millions
S2E8 : 3.86 Millions
S2E9 : 3.38 Millions
S2E10 : 4.2 Millions
S3E1 : 4.37 Millions
S3E2 : 4.27 Millions
S3E3 : 4.72 Millions
S3E4 : 4.87 Millions
S3E5 : 5.35 Millions
S3E6 : 5.5 Millions
S3E7 : 4.84 Millions
S3E8 : 5.13 Millions
S3E9 : 5.22 Millions
S3E10 : 5.39 Millions
S4E1 : 6.64 Millions
S4E2 : 6.31 Millions
S4E3 : 6.59 Millions
S4E4 : 6.95 Millions
S4E5 : 7.16 Millions
S4E6 : 6.4 Millions
S4E7 : 7.2 Millions
S4E8 : 7.17 Millions
S4E9 : 6.95 Millions
S4E10 : 7.09 Millions
S5E1 : 8.0 Millions
S5E2 : 6.81 Millions
S5E3 : 6.71 Millions
S5E4 : 6.82 Millions
S5E5 : 6.56 Millions
S5E6 : 6.24 Millions
S5E7 : 5.4 Millions
S5E8 : 7.01 Millions
S5E9 : 7.14 Millions
S5E10 : 8.11 Millions
S6E1 : 7.94 Millions
S6E2 : 7.29 Millions
S6E3 : 7.28 Millions
S6E4 : 7.82 Millions
S6E5 : 7.89 Millions
S6E6 : 6.71 Millions
S6E7 : 7.8 Millions
S6E8 : 7.6 Millions
S6E9 : 7.66 Millions
S6E10 : 8.89 Millions
S7E1 : 10.11 Millions
S7E2 : 9.27 Millions
S7E3 : 9.25 Millions
S7E4 : 10.17 Millions
S7E5 : 10.72 Millions
S7E6 : 10.24 Millions
S7E7 : 12.07 Millions
The total number of views is 398.73 millions
你可以像 Ali 告诉你的那样做,除了你不应该对它求和,而只是输出它并在我的例子中将它求和到单独的变量中:
totalViewsPerSeason
工作解决方案:
import re
import urllib
from BeautifulSoup import BeautifulSoup
wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
views = 0
grandTotalViews = 0
season_num = 1
for season in seasons:
season_url = 'https://en.wikipedia.org' + season['href']
season_html = urllib.urlopen(season_url).read()
season_content = BeautifulSoup(season_html)
episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
if episodes_table:
episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
if episode_rows:
episode_num = 1
totalViewsPerSeason = 0
for episode_row in episode_rows:
episode_views = episode_row.findAll('td')[-1]
views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
grandTotalViews += views
totalViewsPerSeason += views
print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
episode_num += 1
print "Total season " + str(season_num) + " views: " + str(totalViewsPerSeason) + " Millions\n"
season_num += 1
print 'The total number of views is ' + str(grandTotalViews) + ' millions'
嘿,我正在做第二个学校项目,在 BeautifulSoup 的帮助下是一个 Python 爬虫。好的,我的作业如下:我应该 assemble 一个从维基百科抓取日期并提供 GoT 所有季节的总视图的应用程序,如果该应用程序可以实现以下功能:显示总计总计之前的所有季节,还可以逐集给出总观看次数和总计,并在总计结束时给出总计。
像那样: S01E1:2.22 百万 S02E2:2.20 百万 . . . 第 1 季的总观看次数:xy
总计:398.7 百万
不知何故我只管理了总计。
如果有人做过类似的事情请帮忙:) 非常感谢:
import re
import urllib
from BeautifulSoup import BeautifulSoup
wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
views = 0
for season in seasons:
season_url = 'https://en.wikipedia.org' + season['href']
season_html = urllib.urlopen(season_url).read()
season_content = BeautifulSoup(season_html)
episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
if episodes_table:
episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
if episode_rows:
for episode_row in episode_rows:
episode_views = episode_row.findAll('td')[-1]
views += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
print 'The total number of views is ' + str(views) + ' millions'
无需进行解析工作。我所要做的就是研究如何以问题中您想要的格式在屏幕上输出结果,更像是字符串操作。
代码:
import re
import urllib
from bs4 import BeautifulSoup
wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html, 'html.parser')
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
views = 0
total = 0
season_num = 1
for season in seasons:
season_url = 'https://en.wikipedia.org' + season['href']
season_html = urllib.urlopen(season_url).read()
season_content = BeautifulSoup(season_html,'html.parser')
episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
if episodes_table:
episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
if episode_rows:
episode_num = 1
for episode_row in episode_rows:
episode_views = episode_row.findAll('td')[-1]
views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
total += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
episode_num += 1
season_num += 1
print 'The total number of views is ' + str(total) + ' millions'
输出:
S1E1 : 2.22 Millions
S1E2 : 2.2 Millions
S1E3 : 2.44 Millions
S1E4 : 2.45 Millions
S1E5 : 2.58 Millions
S1E6 : 2.44 Millions
S1E7 : 2.4 Millions
S1E8 : 2.72 Millions
S1E9 : 2.66 Millions
S1E10 : 3.04 Millions
S2E1 : 3.86 Millions
S2E2 : 3.76 Millions
S2E3 : 3.77 Millions
S2E4 : 3.65 Millions
S2E5 : 3.9 Millions
S2E6 : 3.88 Millions
S2E7 : 3.69 Millions
S2E8 : 3.86 Millions
S2E9 : 3.38 Millions
S2E10 : 4.2 Millions
S3E1 : 4.37 Millions
S3E2 : 4.27 Millions
S3E3 : 4.72 Millions
S3E4 : 4.87 Millions
S3E5 : 5.35 Millions
S3E6 : 5.5 Millions
S3E7 : 4.84 Millions
S3E8 : 5.13 Millions
S3E9 : 5.22 Millions
S3E10 : 5.39 Millions
S4E1 : 6.64 Millions
S4E2 : 6.31 Millions
S4E3 : 6.59 Millions
S4E4 : 6.95 Millions
S4E5 : 7.16 Millions
S4E6 : 6.4 Millions
S4E7 : 7.2 Millions
S4E8 : 7.17 Millions
S4E9 : 6.95 Millions
S4E10 : 7.09 Millions
S5E1 : 8.0 Millions
S5E2 : 6.81 Millions
S5E3 : 6.71 Millions
S5E4 : 6.82 Millions
S5E5 : 6.56 Millions
S5E6 : 6.24 Millions
S5E7 : 5.4 Millions
S5E8 : 7.01 Millions
S5E9 : 7.14 Millions
S5E10 : 8.11 Millions
S6E1 : 7.94 Millions
S6E2 : 7.29 Millions
S6E3 : 7.28 Millions
S6E4 : 7.82 Millions
S6E5 : 7.89 Millions
S6E6 : 6.71 Millions
S6E7 : 7.8 Millions
S6E8 : 7.6 Millions
S6E9 : 7.66 Millions
S6E10 : 8.89 Millions
S7E1 : 10.11 Millions
S7E2 : 9.27 Millions
S7E3 : 9.25 Millions
S7E4 : 10.17 Millions
S7E5 : 10.72 Millions
S7E6 : 10.24 Millions
S7E7 : 12.07 Millions
The total number of views is 398.73 millions
你可以像 Ali 告诉你的那样做,除了你不应该对它求和,而只是输出它并在我的例子中将它求和到单独的变量中:
totalViewsPerSeason
工作解决方案:
import re
import urllib
from BeautifulSoup import BeautifulSoup
wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})
views = 0
grandTotalViews = 0
season_num = 1
for season in seasons:
season_url = 'https://en.wikipedia.org' + season['href']
season_html = urllib.urlopen(season_url).read()
season_content = BeautifulSoup(season_html)
episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
if episodes_table:
episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
if episode_rows:
episode_num = 1
totalViewsPerSeason = 0
for episode_row in episode_rows:
episode_views = episode_row.findAll('td')[-1]
views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text)) # here we search for numbers in the text with a help of a regex (regular expression)
grandTotalViews += views
totalViewsPerSeason += views
print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
episode_num += 1
print "Total season " + str(season_num) + " views: " + str(totalViewsPerSeason) + " Millions\n"
season_num += 1
print 'The total number of views is ' + str(grandTotalViews) + ' millions'