我想使用 python 提取 IMDb 电影 ID

Question

这是我的代码：所以我想提取所有宝莱坞电影，并且该项目需要电影标题、演员、工作人员、IMDB ID 等……我无法获得所有带有错误 nonetype 的 IMDb ID。当我只在一个页面上使用它时它工作得很好，但是当我在多个页面上使用它时它显示错误。请帮助

#importing the libraries needed 
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
from time import sleep
from random import randint

#declaring the list of empty variables, So that we can append the data overall

movie_name = []
year = []
time=[]
rating=[]
votes = []
description = []
director_s = []
starList= []
imdb_id = []

#the whole core of the script
url = "https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=num_votes,desc&start=1&ref_=adv_nxt"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
movie_data = soup.findAll('div', attrs = {'class': 'lister-item mode-advanced'})

for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)
    
    year_of_release = store.h3.find('span', class_ = "lister-item-year text-muted unbold").text
    year.append(year_of_release)
        
    runtime = store.p.find("span", class_ = 'runtime').text if store.p.find("span", class_ = 'runtime') else " "
    time.append(runtime)
        
    rate = store.find('div', class_ = "inline-block ratings-imdb-rating").text.replace('\n', '') if store.find('div', class_ = "inline-block ratings-imdb-rating") else " "
    rating.append(rate)
        
    value = store.find_all('span', attrs = {'name': "nv"})
        
    vote = value[0].text if store.find_all('span', attrs = {'name': "nv"}) else " "
    votes.append(vote)
        
    # Description of the Movies 
    describe = store.find_all('p', class_ = 'text-muted')
    description_ = describe[1].text.replace('\n', '') if len(describe) > 1 else ' '
    description.append(description_)
        
    ## Director  
    ps = store.find_all('p')
    for p in ps:
        if 'Director'in p.text:
            director =p.find('a').text
    
    director_s.append(director)
    
    ## ID
    imdbID = store.find('span','rating-cancel').a['href'].split('/')[2]
    imdb_id.append(imdbID)

    ## actors
    star = store.find("p", attrs={"class":""}).text.replace("Stars:", "").replace("\n", "").replace("Director:", "").strip()
    starList.append(star)


Error:
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17576/2711511120.py in <module>
     63 
     64         ## IDs
---> 65         imdbID = store.find('span','rating-cancel').a['href'].split('/')[2] if store.find('span','rating-cancel').a['href'].split('/')[2] else ' '
     66         imdb_id.append(imdbID)
     67 

AttributeError: 'NoneType' object has no attribute 'a'

Answer 1

将您的条件更改为以下内容，因为首先您必须检查 <span> 是否存在：

imdbID = store.find('span','rating-cancel').a.get('href').split('/')[2] if store.find('span','rating-cancel') else ' '

例子

检查 url，这里有一些 <span> 缺失：

import requests
from bs4 import BeautifulSoup

#the whole core of the script
url = "https://www.imdb.com/search/title/?title_type=feature&primary_language=hi&sort=my_ratings,desc"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
movie_data = soup.find_all('div', attrs = {'class': 'lister-item mode-advanced'})

for store in movie_data:
    imdbID = store.find('span','rating-cancel').a.get('href').split('/')[2] if store.find('span','rating-cancel') else ' '
    print(imdbID)

输出

最好通过图像标签抓取 ID，因为即使只有占位符，它们也始终存在：

imdbID = store.img.get('data-tconst')

我想使用 python 提取 IMDb 电影 ID

I want to extract IMDb movie IDs using python

python

data-science

例子

输出