使用 selenium Python 的 Instagram 网页抓取问题
Instagram web scraping with selenium Python problem
我在从 Instagram 个人资料中抓取所有图片时遇到问题,我将页面滚动到底部然后找到所有“a”标签,最后我总是只获得最后 30 个图片链接。我认为驱动程序没有看到页面的全部内容。
#scroll
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
last_count = scrolldown
time.sleep(2)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True
#posts
posts = []
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.append(post)
看起来您是先滚动到页面底部然后才获取链接,而不是获取链接并在滚动循环中处理它们。
所以,如果你想获得所有链接,你应该执行
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.append(post)
在滚动内部,也在第一次滚动之前。
像这样:
def get_links():
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.add(post)
posts = set()
get_links()
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
get_links()
last_count = scrolldown
time.sleep(2)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True
我在从 Instagram 个人资料中抓取所有图片时遇到问题,我将页面滚动到底部然后找到所有“a”标签,最后我总是只获得最后 30 个图片链接。我认为驱动程序没有看到页面的全部内容。
#scroll
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
last_count = scrolldown
time.sleep(2)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True
#posts
posts = []
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.append(post)
看起来您是先滚动到页面底部然后才获取链接,而不是获取链接并在滚动循环中处理它们。
所以,如果你想获得所有链接,你应该执行
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.append(post)
在滚动内部,也在第一次滚动之前。
像这样:
def get_links():
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.add(post)
posts = set()
get_links()
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
get_links()
last_count = scrolldown
time.sleep(2)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True