从网站中仅提取所需的列

Question

我有这段代码，它从 imdb 中删除了这些数据：前 250 部电影、字段名称、年份和评级..我想弄清楚如何只提取布拉德皮特所在的电影，我已经搜索了很多类似的问题，但是 none 真的很有帮助，感谢您的贡献！

import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
for index in range(0, len(movies)):
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index)) + 1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index)) - (len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

for item in imdb:
    print(item['place'], '-', item['movie_title'], '(' + item['year'] + ') -', 'Starring:', item['star_cast'])

Answer 1

Post 处理您的 idmb 列表，您可以执行以下操作，以在 star_cast 中获得 brad pitt 的所有结果：

for item in imdb:
    if item['star_cast'].find('Brad Pitt') !=-1:
        print(item['place'], '-', item['movie_title'], '(' + item['year'] + ') -', 'Starring:', item['star_cast'])

输出

11 - Fight Club (1999) - Starring: David Fincher (dir.), Brad Pitt, Edward Norton
20 - Sieben (1995) - Starring: David Fincher (dir.), Morgan Freeman, Brad Pitt
85 - Inglourious Basterds (2009) - Starring: Quentin Tarantino (dir.), Brad Pitt, Diane Kruger
105 - Snatch - Schweine und Diamanten (2000) - Starring: Guy Ritchie (dir.), Jason Statham, Brad Pitt

从网站中仅提取所需的列

Extract only desired columns from a website

html

python

beautifulsoup

request

web-scraping