使用 mechanize 发布抓取动态内容
Issue scraping dynamic content with mechanize
我正在尝试抓取流媒体视频网站上的搜索结果。搜索结果是动态加载的,我认为这就是我没有得到正确结果的原因。在我提交表格并取回我的 results.html 之后,它始终是主页,但没有完成搜索...任何帮助都会很棒,如果 Mechanize 根本没有此功能,也许有人可以指点我正确的方向?提前致谢。
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla')]
br.open('http://movietv.to')
br.select_form(nr=0)
br.form.set_all_readonly(False)
br.form.set_value("Godfather", nr=0)
resp = br.submit()
with open('results.html', 'w') as f:
f.write(resp.read())
Mechanize
并不是这个特定 "dynamic" 站点的理想选择。最简单的高级方法是通过 selenium
.
使用浏览器自动化
尽管我使用 requests
and BeautifulSoup
HTML 解析器使其工作。要实现的关键是 1) 电影通过 XHR 请求加载到 http://movietv.to/index/loadmovies
和 2) 此请求需要发送 token
- 它可以从 script
中提取主页上的元素。
完整的工作实现:
import re
import requests
from bs4 import BeautifulSoup
search = "Godfather"
token_pattern = re.compile(r'var token_key="(.*?)";')
with requests.Session() as session: # maintaining web-scraping session
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
# extract token
response = session.get("http://movietv.to/")
soup = BeautifulSoup(response.content, "html.parser")
script = soup.find("script", text=token_pattern).get_text()
match = token_pattern.search(script)
if not match:
raise ValueError("Cannot find token!")
token = match.group(1)
# search for the movies
response = session.post("http://movietv.to/index/loadmovies", data={
"loadmovies": "showData",
"page": "1",
"abc": "All",
"genres": "",
"sortby": "Popularity",
"quality": "All",
"type": "movie",
"q": search,
"token": token
})
soup = BeautifulSoup(response.content, "html.parser")
for movie in soup.select("div.item"):
title = movie.find("h2", class_="movie-title")
print(title.get_text())
打印找到的电影:
The Godfather
The Godfather: Part II
The Godfather: Part III
我正在尝试抓取流媒体视频网站上的搜索结果。搜索结果是动态加载的,我认为这就是我没有得到正确结果的原因。在我提交表格并取回我的 results.html 之后,它始终是主页,但没有完成搜索...任何帮助都会很棒,如果 Mechanize 根本没有此功能,也许有人可以指点我正确的方向?提前致谢。
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla')]
br.open('http://movietv.to')
br.select_form(nr=0)
br.form.set_all_readonly(False)
br.form.set_value("Godfather", nr=0)
resp = br.submit()
with open('results.html', 'w') as f:
f.write(resp.read())
Mechanize
并不是这个特定 "dynamic" 站点的理想选择。最简单的高级方法是通过 selenium
.
尽管我使用 requests
and BeautifulSoup
HTML 解析器使其工作。要实现的关键是 1) 电影通过 XHR 请求加载到 http://movietv.to/index/loadmovies
和 2) 此请求需要发送 token
- 它可以从 script
中提取主页上的元素。
完整的工作实现:
import re
import requests
from bs4 import BeautifulSoup
search = "Godfather"
token_pattern = re.compile(r'var token_key="(.*?)";')
with requests.Session() as session: # maintaining web-scraping session
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
# extract token
response = session.get("http://movietv.to/")
soup = BeautifulSoup(response.content, "html.parser")
script = soup.find("script", text=token_pattern).get_text()
match = token_pattern.search(script)
if not match:
raise ValueError("Cannot find token!")
token = match.group(1)
# search for the movies
response = session.post("http://movietv.to/index/loadmovies", data={
"loadmovies": "showData",
"page": "1",
"abc": "All",
"genres": "",
"sortby": "Popularity",
"quality": "All",
"type": "movie",
"q": search,
"token": token
})
soup = BeautifulSoup(response.content, "html.parser")
for movie in soup.select("div.item"):
title = movie.find("h2", class_="movie-title")
print(title.get_text())
打印找到的电影:
The Godfather
The Godfather: Part II
The Godfather: Part III