从网站中仅提取所需的列
Extract only desired columns from a website
我有这段代码,它从 imdb 中删除了这些数据:前 250 部电影、字段名称、年份和评级..我想弄清楚如何只提取布拉德皮特所在的电影,我已经搜索了很多类似的问题,但是 none 真的很有帮助,感谢您的贡献!
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
for index in range(0, len(movies)):
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index)) + 1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index)) - (len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
imdb.append(data)
for item in imdb:
print(item['place'], '-', item['movie_title'], '(' + item['year'] + ') -', 'Starring:', item['star_cast'])
Post 处理您的 idmb 列表,您可以执行以下操作,以在 star_cast
中获得 brad pitt
的所有结果:
for item in imdb:
if item['star_cast'].find('Brad Pitt') !=-1:
print(item['place'], '-', item['movie_title'], '(' + item['year'] + ') -', 'Starring:', item['star_cast'])
输出
11 - Fight Club (1999) - Starring: David Fincher (dir.), Brad Pitt, Edward Norton
20 - Sieben (1995) - Starring: David Fincher (dir.), Morgan Freeman, Brad Pitt
85 - Inglourious Basterds (2009) - Starring: Quentin Tarantino (dir.), Brad Pitt, Diane Kruger
105 - Snatch - Schweine und Diamanten (2000) - Starring: Guy Ritchie (dir.), Jason Statham, Brad Pitt
我有这段代码,它从 imdb 中删除了这些数据:前 250 部电影、字段名称、年份和评级..我想弄清楚如何只提取布拉德皮特所在的电影,我已经搜索了很多类似的问题,但是 none 真的很有帮助,感谢您的贡献!
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
for index in range(0, len(movies)):
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index)) + 1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index)) - (len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
imdb.append(data)
for item in imdb:
print(item['place'], '-', item['movie_title'], '(' + item['year'] + ') -', 'Starring:', item['star_cast'])
Post 处理您的 idmb 列表,您可以执行以下操作,以在 star_cast
中获得 brad pitt
的所有结果:
for item in imdb:
if item['star_cast'].find('Brad Pitt') !=-1:
print(item['place'], '-', item['movie_title'], '(' + item['year'] + ') -', 'Starring:', item['star_cast'])
输出
11 - Fight Club (1999) - Starring: David Fincher (dir.), Brad Pitt, Edward Norton
20 - Sieben (1995) - Starring: David Fincher (dir.), Morgan Freeman, Brad Pitt
85 - Inglourious Basterds (2009) - Starring: Quentin Tarantino (dir.), Brad Pitt, Diane Kruger
105 - Snatch - Schweine und Diamanten (2000) - Starring: Guy Ritchie (dir.), Jason Statham, Brad Pitt