如何在多线程环境下用requests-html渲染异步页面?
How to render asynchronous page with requests-html in a multithreaded environment?
为了为动态加载内容的页面创建抓取工具,requests-html
提供了在 JS 执行后获取渲染页面的模块。但是,当尝试通过在多线程实现中调用 arender()
方法来使用 AsyncHTMLSession
时,生成的 HTML 不会改变。
例如在源代码中提供的 URL 中,表 HTML 值默认为空,在脚本执行后,由 arender()
方法模拟,预计将值插入标记中, 尽管在源代码中没有发现可见的变化。
from pprint import pprint
#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor
from requests_html import AsyncHTMLSession, HTML
async def fetch(session, url):
r = await session.get(url)
await r.html.arender()
return r.content
def parseWebpage(page):
print(page)
async def get_data_asynchronous():
urls = [
'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
]
with ThreadPoolExecutor(max_workers=20) as executor:
with AsyncHTMLSession() as session:
# Set any session parameters here before calling `fetch`
# Initialize the event loop
loop = asyncio.get_event_loop()
# Use list comprehension to create a list of
# tasks to complete. The executor will run the `fetch`
# function for each url in the urlslist
tasks = [
await loop.run_in_executor(
executor,
fetch,
*(session, url) # Allows us to pass in multiple arguments to `fetch`
)
for url in urls
]
# Initializes the tasks to run and awaits their results
for response in await asyncio.gather(*tasks):
parseWebpage(response)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_data_asynchronous())
loop.run_until_complete(future)
main()
源代码表示post渲染方法的执行不在session的content
属性下,而是在HTML对象中的raw_html
下。在这种情况下,返回的值应该是 r.html.raw_html
.
为了为动态加载内容的页面创建抓取工具,requests-html
提供了在 JS 执行后获取渲染页面的模块。但是,当尝试通过在多线程实现中调用 arender()
方法来使用 AsyncHTMLSession
时,生成的 HTML 不会改变。
例如在源代码中提供的 URL 中,表 HTML 值默认为空,在脚本执行后,由 arender()
方法模拟,预计将值插入标记中, 尽管在源代码中没有发现可见的变化。
from pprint import pprint
#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor
from requests_html import AsyncHTMLSession, HTML
async def fetch(session, url):
r = await session.get(url)
await r.html.arender()
return r.content
def parseWebpage(page):
print(page)
async def get_data_asynchronous():
urls = [
'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
]
with ThreadPoolExecutor(max_workers=20) as executor:
with AsyncHTMLSession() as session:
# Set any session parameters here before calling `fetch`
# Initialize the event loop
loop = asyncio.get_event_loop()
# Use list comprehension to create a list of
# tasks to complete. The executor will run the `fetch`
# function for each url in the urlslist
tasks = [
await loop.run_in_executor(
executor,
fetch,
*(session, url) # Allows us to pass in multiple arguments to `fetch`
)
for url in urls
]
# Initializes the tasks to run and awaits their results
for response in await asyncio.gather(*tasks):
parseWebpage(response)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_data_asynchronous())
loop.run_until_complete(future)
main()
源代码表示post渲染方法的执行不在session的content
属性下,而是在HTML对象中的raw_html
下。在这种情况下,返回的值应该是 r.html.raw_html
.