IMDB 的 Web 抓取无法检索所需的列

Web scraping for IMDB unable to retrieve desired columns

我试过在 IMDB 网站上进行网页抓取。我正在寻找 Top 50 Horror Movies。我想 抓取 the movie nameratingdirector namegenreruntime.

我检查了电影名称的元素

检查评级和导演姓名的元素

检查运行时元素、流派

我在检查了这些元素的标题、导演姓名、评级、运行时、类型后编写了代码。

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
#r = requests.get(my_url, headers=headers)#, proxies=proxies)
request=urllib.request.Request(my_url,None,headers)
response = urllib.request.urlopen(request)
page_html = response.read()
page_soup = BeautifulSoup(page_html,"html.parser")
page_soup.h1
page_soup.body.span
containers = page_soup.findAll("div",{"class":"lister-item mode-advanced"})
print(len(containers))

for container in containers:
  title=container.findAll("a",{"class": "lister-item-index unbold-text-primary"})
  rating = container.findAll("div",{"class":"inline-block.ratings-imdb-rating"})
  duration = container.findAll("span",{"class":"runtime"})
  genre = container.findAll("span",{"class":"genre"})
  director = container.findAll("p",{"class":"text-muted"})

print(title)
print(rating)
print(duration)
print(genre)
print(director) 

但是,我的代码无法检索这些属性。

输出:

50
[]
[]
[<span class="runtime">90 min</span>]
[<span class="genre">
Horror, Mystery, Thriller            </span>]
[<p class="text-muted ">
<span class="runtime">90 min</span>
<span class="ghost">|</span>
<span class="genre">
Horror, Mystery, Thriller            </span>
</p>, <p class="text-muted">
    A decades-old folk tale surrounding a deranged murderer killing those who celebrate Valentine's Day turns out to be true to legend when a group defies the killer's order and people start turning up dead.</p>]

如果有人能帮助我找出我丢失的东西,那将会很有帮助。

您没有正确处理列表。必须更具体地说明标签和搜索数据的方法。并将 findall 更改为 find.

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'
page = requests.get(my_url, headers=headers)
page_soup = BeautifulSoup(page.text,"html.parser")
for container in containers:
  print(container.find("a", href=re.compile('adv_li_tt')).text)
  print(container.find("strong").text)
  print(container.find("span",{"class":"runtime"}).text)
  print(container.find("span",{"class":"genre"}).text.strip())
  print(container.find('a', href=re.compile('adv_li_dr_0')).text)
  print('\n')

输出

Wrong Turn
5.4
109 min
Horror, Thriller
Mike P. Nelson


Willy's Wonderland
5.7
88 min
Action, Comedy, Horror
Kevin Lewis


Red Dot
5.5
86 min
Drama, Horror, Thriller
Alain Darborg

HTML就像一个树状结构。您想要找到父节点,然后遍历这些节点以获取其中的内容。这个网站非常适合练习。 Director 是唯一棘手的部分,因为它在 <p> 标签中,但没有属性来区分它。所以你需要做一点逻辑来得到它。 (请注意,您可以使用正则表达式来查找它,但由于您正在学习,所以想向您展示一个循环)。我还附上了图片,这样您就可以看到我从哪里获得这些标签和属性:

import requests
from bs4 import BeautifulSoup


headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
my_url = 'https://www.imdb.com/search/title/?genres=horror&title_type=feature&explore=genres'

response = requests.get(my_url, headers=headers)
page_html = response.text
page_soup = BeautifulSoup(page_html,"html.parser")


movies = page_soup.find_all('div',{'class':'lister-item-content'})
for movie in movies:
    title = movie.find('h3').find('a').text
    try:
        rating = movie.find('p').find('span', {'class':'certificate'}).text
    except:
        rating = ''
    genre = movie.find('p').find('span', {'class':'genre'}).text.strip()
    try:
        runtime = movie.find('p').find('span', {'class':'runtime'}).text
    except:
        runtime = ''
    ps = movie.find_all('p')
    for p in ps:
        if 'Director'in p.text:
            director =p.find('a').text
            
    print(title, rating, genre, runtime, director)

输出:

Wrong Turn 18 Horror, Thriller 109 min Mike P. Nelson
Willy's Wonderland 15 Action, Comedy, Horror 88 min Kevin Lewis
Red Dot 15 Drama, Horror, Thriller 86 min Alain Darborg
Saint Maud 15 Drama, Horror, Mystery 84 min Rose Glass
Freaky 15 Comedy, Horror, Thriller 102 min Christopher Landon
Doctor Strange in the Multiverse of Madness  Action, Adventure, Fantasy  Sam Raimi
Midsommar 18 Drama, Horror, Mystery 148 min Ari Aster
Fear of Rain PG-13 Drama, Horror, Thriller 109 min Castille Landon
The Little Stranger 12A Drama, Horror, Mystery 111 min Lenny Abrahamson
Army of the Dead R Action, Crime, Horror  Zack Snyder
Get Out 15 Horror, Mystery, Thriller 104 min Jordan Peele
Synchronic 15 Drama, Horror, Sci-Fi 102 min Justin Benson
The Rental 15 Drama, Horror, Mystery 88 min Dave Franco
Shadow in the Cloud R Action, Horror, War 83 min Roseanne Liang
Don't Worry Darling  Horror, Thriller  Olivia Wilde
Venom: Let There Be Carnage  Action, Horror, Sci-Fi  Andy Serkis
The Shining 15 Drama, Horror 146 min Stanley Kubrick
The Witch 15 Drama, Horror, Mystery 92 min Robert Eggers
Split 15 Horror, Thriller 117 min M. Night Shyamalan
Hereditary 15 Drama, Horror, Mystery 127 min Ari Aster
Wrong Turn 18 Horror, Thriller 84 min Rob Schmidt
Antebellum 15 Drama, Horror, Mystery 105 min Gerard Bush
Possessor 18 Horror, Sci-Fi, Thriller 103 min Brandon Cronenberg
The New Mutants 15 Action, Horror, Sci-Fi 94 min Josh Boone
Doctor Sleep 15 Drama, Fantasy, Horror 152 min Mike Flanagan
The Invisible Man R Drama, Horror, Mystery 124 min Leigh Whannell
The Meg 12A Action, Horror, Sci-Fi 113 min Jon Turteltaub
Alien X Horror, Sci-Fi 117 min Ridley Scott
The Lighthouse 15 Drama, Fantasy, Horror 109 min Robert Eggers
Scream  Horror, Mystery, Thriller  Matt Bettinelli-Olpin
Run PG-13 Horror, Mystery, Thriller 90 min Aneesh Chaganty
Porno 18 Comedy, Horror 98 min Keola Racela
The Hunt 15 Action, Horror, Thriller 90 min Craig Zobel
Becky 18 Action, Crime, Drama 93 min Jonathan Milott
It 15 Horror 135 min Andy Muschietti
Dark Water 15 Drama, Horror, Mystery 105 min Walter Salles
A Quiet Place Part II 15 Drama, Horror, Sci-Fi 97 min John Krasinski
A Quiet Place 15 Drama, Horror, Sci-Fi 90 min John Krasinski
The Witches PG Adventure, Comedy, Family 106 min Robert Zemeckis
Resident Evil  Action, Horror, Mystery  Johannes Roberts
Us 15 Horror, Mystery, Thriller 116 min Jordan Peele
Psycho Goreman  Comedy, Horror, Sci-Fi 95 min Steven Kostanski
The Empty Man 18 Crime, Drama, Horror 137 min David Prior
From Dusk Till Dawn 18 Action, Crime, Horror 108 min Robert Rodriguez
The Platform 18 Horror, Sci-Fi, Thriller 94 min Galder Gaztelu-Urrutia
The Conjuring 3  Horror, Mystery, Thriller  Michael Chaves
Underwater 15 Action, Horror, Sci-Fi 95 min William Eubank
My Bloody Valentine 18 Horror, Mystery, Thriller 101 min Patrick Lussier
Sputnik 15 Drama, Horror, Sci-Fi 113 min Egor Abramenko
My Bloody Valentine X Horror, Mystery, Thriller 90 min George Mihalka