PYTHON scrapy selenium WebDriverWait
PYTHON scrapy selenium WebDriverWait
这里的专家,如果你不介意,我正在寻求你的帮助。
最近,我正在 python 中使用 scrapy 和 selenium 开发网络爬虫。我的心碎了。
我就想问一下,有没有可能你用了语句还是空的
WebDriverWait(driver, 100, 0.1).until(EC.presence_of_all_elements_located((By.XPATH,xxxxx)))
获取那些元素。而且,它 甚至不需要 100 秒 就变空了。为什么?
顺便说一句,它是随机的,这意味着这种现象随时随地都会发生。
变空是否与我的网络连接有关?
你能帮我看看上面的问题,或者给我一些意见,建议吗?
非常感谢!
--------------------补充说明-------------------- ---
感谢提醒。
综上所述,我使用scrapy
和selenium
抓取了一个关于评论的网站,并通过[=将用户名、发布时间、评论内容等写入.xlsx file
17=], 我希望它在收集完整信息的同时尽可能快。
一个有很多人评论的页面,由于评论文本太长而被收起来,这意味着每页将近 20 条评论都有其展开按钮。
因此,我需要使用selenium
到click the expand button
,然后使用驱动程序来获取完整的评论。常识表明,单击展开按钮后需要 一点时间 才能加载,我认为所需时间取决于网络速度。所以在这里使用 WebDriverWait
似乎是一个明智的选择。经过我的实践,默认参数timeout=10
和poll_frequency=0.5
好像太慢了,容易出错。所以我考虑使用 timeout=100
和 poll_frequency=0.1
.
的规范
但是问题是每次我运行项目通过cmd
语句scrapy crawl spider
,总是有几个评论爬取是空的,每次定位的空是不同的。我考虑过使用 time.sleep()
强制停止,但如果每个页面都这样做会花费很多时间,虽然它肯定是获取完整信息的更有用的方法。另外,在我看来,它看起来不太优雅,有点笨拙。
我的问题表达清楚了吗?
--------------------------------添加一些东西------------ --------------------
I got somwhere empty
的具体含义如下图
----------------------------添加我的code--------------------------2022/5/18
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
full_content, words = [], []
unfolds = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//a[@class='unfold']")))
# Here's how I think about and design my loop body.
# I click the expansion bottun, then grab the text, then put it away, then move on to the next one.
for i in range(len(unfolds)):
unfolds[i].click()
time.sleep(1)
# After the javascript, the `div[@class='review-content clearfix']` appear,
# and some of the full review content will be put in a `<p></p>` label
find_full_content_p = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']/p")))
full_content_p = [j.text for j in find_full_content_p]
# and some of them will just put in `div[@class='review-content clearfix']` itself.
find_full_content_div = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']")))
full_content_div = [j.text for j in find_full_content_div]
# and I made a list merge
full_content_p.extend(full_content_div)
full_content.append("".join(full_content_p))
words.append(len("".join(full_content_p)))
time.sleep(1)
# then put it away
WebDriverWait(driver,100,0.1).until(EC.element_to_be_clickable((By.XPATH,"//a[@class='fold']"))).click()
driver.close()
pd.DataFrame({"users":users, "dates":dates, "full_content":full_content, "words":words})
AND,这是我真正尊重的专家的代码命名声波(略有修改,核心代码未更改)
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews, words = [], []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
words.append(len(review.text))
print('done',len(reviews),end='\r')
pd.DataFrame({"users":users,"dates":dates,"reviews":reviews,"words":words})
新
为站点添加代码 douban
。要将抓取的数据导出到 csv,请参阅下面旧部分中的 pandas 代码
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service('...'))
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews = []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
print('done',len(reviews),end='\r')
旧
对于您提到的网站 (imdb.com),为了抓取隐藏内容,无需单击“显示更多”按钮,因为文本已经加载到 HTML 代码中,只是它没有显示在网站上。因此,您可以一次抓取所有评论。下面的代码将用户、日期和评论存储在单独的列表中,最后将数据保存到 .csv
文件。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service(chromedriver_path))
driver.get('https://www.imdb.com/title/tt1683526/reviews')
# sets a maximum waiting time for .find_element() and similar commands
driver.implicitly_wait(10)
reviews = [el.get_attribute('innerText') for el in driver.find_elements(By.CSS_SELECTOR, 'div.text')]
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.display-name-link')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.review-date')]
# store data in a csv file
import pandas as pd
df = pd.DataFrame(list(zip(users,dates,reviews)), columns=['user','date','review'])
df.to_csv(r'C:\Users\your_name\Desktop\data.csv', index=False)
要打印单个评论,您可以这样做
i = 0
print(f'User: {users[i]}\nDate: {dates[i]}\n{reviews[i]}')
输出(截断)是
User: dschmeding
Date: 26 February 2012
Wow! I was not expecting this movie to be this engaging. Its one of those films...
这里的专家,如果你不介意,我正在寻求你的帮助。
最近,我正在 python 中使用 scrapy 和 selenium 开发网络爬虫。我的心碎了。
我就想问一下,有没有可能你用了语句还是空的
WebDriverWait(driver, 100, 0.1).until(EC.presence_of_all_elements_located((By.XPATH,xxxxx)))
获取那些元素。而且,它 甚至不需要 100 秒 就变空了。为什么?
顺便说一句,它是随机的,这意味着这种现象随时随地都会发生。
变空是否与我的网络连接有关?
你能帮我看看上面的问题,或者给我一些意见,建议吗?
非常感谢!
--------------------补充说明-------------------- ---
感谢提醒。
综上所述,我使用scrapy
和selenium
抓取了一个关于评论的网站,并通过[=将用户名、发布时间、评论内容等写入.xlsx file
17=], 我希望它在收集完整信息的同时尽可能快。
一个有很多人评论的页面,由于评论文本太长而被收起来,这意味着每页将近 20 条评论都有其展开按钮。
因此,我需要使用selenium
到click the expand button
,然后使用驱动程序来获取完整的评论。常识表明,单击展开按钮后需要 一点时间 才能加载,我认为所需时间取决于网络速度。所以在这里使用 WebDriverWait
似乎是一个明智的选择。经过我的实践,默认参数timeout=10
和poll_frequency=0.5
好像太慢了,容易出错。所以我考虑使用 timeout=100
和 poll_frequency=0.1
.
但是问题是每次我运行项目通过cmd
语句scrapy crawl spider
,总是有几个评论爬取是空的,每次定位的空是不同的。我考虑过使用 time.sleep()
强制停止,但如果每个页面都这样做会花费很多时间,虽然它肯定是获取完整信息的更有用的方法。另外,在我看来,它看起来不太优雅,有点笨拙。
我的问题表达清楚了吗?
--------------------------------添加一些东西------------ --------------------
I got somwhere empty
的具体含义如下图
----------------------------添加我的code--------------------------2022/5/18
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
full_content, words = [], []
unfolds = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//a[@class='unfold']")))
# Here's how I think about and design my loop body.
# I click the expansion bottun, then grab the text, then put it away, then move on to the next one.
for i in range(len(unfolds)):
unfolds[i].click()
time.sleep(1)
# After the javascript, the `div[@class='review-content clearfix']` appear,
# and some of the full review content will be put in a `<p></p>` label
find_full_content_p = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']/p")))
full_content_p = [j.text for j in find_full_content_p]
# and some of them will just put in `div[@class='review-content clearfix']` itself.
find_full_content_div = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']")))
full_content_div = [j.text for j in find_full_content_div]
# and I made a list merge
full_content_p.extend(full_content_div)
full_content.append("".join(full_content_p))
words.append(len("".join(full_content_p)))
time.sleep(1)
# then put it away
WebDriverWait(driver,100,0.1).until(EC.element_to_be_clickable((By.XPATH,"//a[@class='fold']"))).click()
driver.close()
pd.DataFrame({"users":users, "dates":dates, "full_content":full_content, "words":words})
AND,这是我真正尊重的专家的代码命名声波(略有修改,核心代码未更改)
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews, words = [], []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
words.append(len(review.text))
print('done',len(reviews),end='\r')
pd.DataFrame({"users":users,"dates":dates,"reviews":reviews,"words":words})
新
为站点添加代码 douban
。要将抓取的数据导出到 csv,请参阅下面旧部分中的 pandas 代码
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service('...'))
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews = []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
print('done',len(reviews),end='\r')
旧
对于您提到的网站 (imdb.com),为了抓取隐藏内容,无需单击“显示更多”按钮,因为文本已经加载到 HTML 代码中,只是它没有显示在网站上。因此,您可以一次抓取所有评论。下面的代码将用户、日期和评论存储在单独的列表中,最后将数据保存到 .csv
文件。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service(chromedriver_path))
driver.get('https://www.imdb.com/title/tt1683526/reviews')
# sets a maximum waiting time for .find_element() and similar commands
driver.implicitly_wait(10)
reviews = [el.get_attribute('innerText') for el in driver.find_elements(By.CSS_SELECTOR, 'div.text')]
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.display-name-link')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.review-date')]
# store data in a csv file
import pandas as pd
df = pd.DataFrame(list(zip(users,dates,reviews)), columns=['user','date','review'])
df.to_csv(r'C:\Users\your_name\Desktop\data.csv', index=False)
要打印单个评论,您可以这样做
i = 0
print(f'User: {users[i]}\nDate: {dates[i]}\n{reviews[i]}')
输出(截断)是
User: dschmeding
Date: 26 February 2012
Wow! I was not expecting this movie to be this engaging. Its one of those films...