权力的游戏维基百科 Python 爬虫

Question

嘿，我正在做第二个学校项目，在 BeautifulSoup 的帮助下是一个 Python 爬虫。好的，我的作业如下：我应该 assemble 一个从维基百科抓取日期并提供 GoT 所有季节的总视图的应用程序，如果该应用程序可以实现以下功能：显示总计总计之前的所有季节，还可以逐集给出总观看次数和总计，并在总计结束时给出总计。

像那样： S01E1:2.22 百万 S02E2:2.20 百万 . . . 第 1 季的总观看次数：xy

总计：398.7 百万

不知何故我只管理了总计。

如果有人做过类似的事情请帮忙:) 非常感谢:

import re
import urllib

from BeautifulSoup import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)

seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})

views = 0

for season in seasons:
    season_url = 'https://en.wikipedia.org' + season['href']
    season_html = urllib.urlopen(season_url).read()
    season_content = BeautifulSoup(season_html)

    episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})

    if episodes_table:
        episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})

        if episode_rows:
            for episode_row in episode_rows:
                episode_views = episode_row.findAll('td')[-1]

                views += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)

print 'The total number of views is ' + str(views) + ' millions'

Answer 1

无需进行解析工作。我所要做的就是研究如何以问题中您想要的格式在屏幕上输出结果，更像是字符串操作。

代码：

import re
import urllib
from bs4 import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html, 'html.parser')
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})

views = 0
total = 0
season_num = 1
for season in seasons:
    season_url = 'https://en.wikipedia.org' + season['href']
    season_html = urllib.urlopen(season_url).read()
    season_content = BeautifulSoup(season_html,'html.parser')
    episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
    if episodes_table:
        episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
        if episode_rows:
            episode_num = 1
            for episode_row in episode_rows:
                episode_views = episode_row.findAll('td')[-1]
                views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                total += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
                episode_num += 1
    season_num += 1

print 'The total number of views is ' + str(total) + ' millions'

输出：

S1E1 : 2.22 Millions
S1E2 : 2.2 Millions
S1E3 : 2.44 Millions
S1E4 : 2.45 Millions
S1E5 : 2.58 Millions
S1E6 : 2.44 Millions
S1E7 : 2.4 Millions
S1E8 : 2.72 Millions
S1E9 : 2.66 Millions
S1E10 : 3.04 Millions
S2E1 : 3.86 Millions
S2E2 : 3.76 Millions
S2E3 : 3.77 Millions
S2E4 : 3.65 Millions
S2E5 : 3.9 Millions
S2E6 : 3.88 Millions
S2E7 : 3.69 Millions
S2E8 : 3.86 Millions
S2E9 : 3.38 Millions
S2E10 : 4.2 Millions
S3E1 : 4.37 Millions
S3E2 : 4.27 Millions
S3E3 : 4.72 Millions
S3E4 : 4.87 Millions
S3E5 : 5.35 Millions
S3E6 : 5.5 Millions
S3E7 : 4.84 Millions
S3E8 : 5.13 Millions
S3E9 : 5.22 Millions
S3E10 : 5.39 Millions
S4E1 : 6.64 Millions
S4E2 : 6.31 Millions
S4E3 : 6.59 Millions
S4E4 : 6.95 Millions
S4E5 : 7.16 Millions
S4E6 : 6.4 Millions
S4E7 : 7.2 Millions
S4E8 : 7.17 Millions
S4E9 : 6.95 Millions
S4E10 : 7.09 Millions
S5E1 : 8.0 Millions
S5E2 : 6.81 Millions
S5E3 : 6.71 Millions
S5E4 : 6.82 Millions
S5E5 : 6.56 Millions
S5E6 : 6.24 Millions
S5E7 : 5.4 Millions
S5E8 : 7.01 Millions
S5E9 : 7.14 Millions
S5E10 : 8.11 Millions
S6E1 : 7.94 Millions
S6E2 : 7.29 Millions
S6E3 : 7.28 Millions
S6E4 : 7.82 Millions
S6E5 : 7.89 Millions
S6E6 : 6.71 Millions
S6E7 : 7.8 Millions
S6E8 : 7.6 Millions
S6E9 : 7.66 Millions
S6E10 : 8.89 Millions
S7E1 : 10.11 Millions
S7E2 : 9.27 Millions
S7E3 : 9.25 Millions
S7E4 : 10.17 Millions
S7E5 : 10.72 Millions
S7E6 : 10.24 Millions
S7E7 : 12.07 Millions
The total number of views is 398.73 millions

Answer 2

你可以像 Ali 告诉你的那样做，除了你不应该对它求和，而只是输出它并在我的例子中将它求和到单独的变量中：

totalViewsPerSeason

工作解决方案：

import re
import urllib

from BeautifulSoup import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)

seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})

views = 0
grandTotalViews = 0
season_num = 1

for season in seasons:
    season_url = 'https://en.wikipedia.org' + season['href']
    season_html = urllib.urlopen(season_url).read()
    season_content = BeautifulSoup(season_html)

    episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})

    if episodes_table:
        episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})

        if episode_rows:
            episode_num = 1
            totalViewsPerSeason = 0
            for episode_row in episode_rows:
                episode_views = episode_row.findAll('td')[-1]

                views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                grandTotalViews += views
                totalViewsPerSeason += views
                print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
                episode_num += 1

    print "Total season " + str(season_num) + " views: " + str(totalViewsPerSeason) + " Millions\n"
    season_num += 1

print 'The total number of views is ' + str(grandTotalViews) + ' millions'

权力的游戏维基百科 Python 爬虫

Game of Thrones Wikipedia Python scraper

python

wikipedia

beautifulsoup

web-scraping