selenium Instagram 刮板复制

Question

在这种情况下，我试图通过散列标签删除 Instagram dog 使用 selenium

滚动加载图片
获取加载图片的帖子链接

但我意识到大部分链接都是重复的（最后 3 行）我不知道是什么问题我什至尝试了很多用于 Instagram 报废的库，但所有这些库要么给出错误要么不通过搜索井号。
我正在尝试删除 Instagram 以获取我的深度学习分类器模型的图像数据我也想知道是否有更好的Instagram抓取方法

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains as AC

driver = webdriver.Edge("msedgedriver.exe")
driver.get("https://www.instagram.com")

tag = "dog"
numberOfScrolls = 70

### Login Section ###

time.sleep(3)
username_field = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[1]/div/label/input')
username_field.send_keys("myusername")

password_field = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[2]/div/label/input')
password_field.send_keys("mypassword")
time.sleep(1)

driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]').click()
time.sleep(5)

### Scarping Section ###

link = "https://www.instagram.com/explore/tags/" + tag
driver.get(link)
time.sleep(5)
Links = []
for i in range(numberOfScrolls):
    AC(driver).send_keys(Keys.END).perform()  # scrolls to the bottom of the page
    time.sleep(1)
    for x in range(1, 8):
        try:
            row = driver.find_element_by_xpath(
                '//*[@id="react-root"]/section/main/article/div[2]/div/div[' + str(i) + ']')
            row = row.find_elements_by_tag_name("a")
            for element in row:
                if element.get_attribute("href") is not None:
                    print(element.get_attribute("href"))
                    Links.append(element.get_attribute("href"))
        except:
            continue

print(len(Links))
Links = list(set(Links))
print(len(Links))

Answer 1

它发现了我的错误

row=driver.find_element_by_xpath('//[@id="reactroot"]/section/main/article/div[2]/div/div[' + str(i) + ']')
特别是在这部分 str(i) 它应该是 x 而不是 i 这就是为什么他们中的大多数重复

selenium Instagram 刮板复制

selenium Instagram scraper duplication

python

selenium

screen-scraping

web-scraping

instagram