卡在循环 <> 代码不想拉除第 1 行以外的任何内容
Stuck in loop <> Code doesn't want to pull anything except row 1
我陷入了循环,我不知道要更改什么才能使我的代码正常工作...
问题出在 CSV 文件上,我的文件包含域列表(freedommortgage.com、google.com、amd.com 等),所以当我 运行 代码时,一开始一切都很好,但是然后它一直向我发送相同的结果:
freedommortgage.com 的每月总访问量为 1.10M
So here is my line:
import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time
# setting the firefox driver
def init_driver():
driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
driver.implicitly_wait(10)
return driver
# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
urllib.urlretrieve(captcha_src, "captcha.jpg")
captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84") # 2captcha API KEY
captcha_answer = captcha.solve("captcha.jpg")
wait = WebDriverWait(driver, 10)
captcha_input_box = wait.until(
EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
captcha_input_box.send_keys(captcha_answer)
driver.implicitly_wait(10)
captcha_input_box.submit()
# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
# short method - inputting the domain in the url
if short_method:
driver.get("https://www.similarweb.com/website/" + domain)
else:
driver.get("https://www.similarweb.com")
attempt = 0
# trying 3 times before quiting (due to second refresh by the website that clears the search box)
while attempt < 1:
try:
captcha_body_page = driver.find_elements_by_class_name("block-page")
driver.implicitly_wait(10)
if captcha_body_page:
print("Captcha ahead, solving the captcha, it may take a few seconds")
captcha_solver(driver)
print("Captcha solved! the program will continue shortly")
time.sleep(20) # to prevent second refresh affecting the upcoming elements finding after captcha solved
# for normal method, inputting the domain in the searchbox instead of url
if not short_method:
input_element = driver.find_element_by_id("js-swSearch-input")
input_element.click()
input_element.send_keys(domain)
input_element.submit()
wait = WebDriverWait(driver, 10)
time.sleep(10)
total_visits = wait.until(
EC.presence_of_element_located((By.XPATH, "//span[@class='engagementInfo-valueNumber js-countValue']")))
total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
time.sleep(10)
print('\n' + total_visits_line)
except TimeoutException:
print("Box or Button or Element not found in similarweb while checking %s" % domain)
attempt += 1
print("attempt number %d... trying again" % attempt)
# main
if __name__ == "__main__":
with open('bigdomains.csv', 'rt') as f:
reader = csv.reader(f)
driver = init_driver()
for row in reader:
domain = row[0]
lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
# short method, False will take the normal method
(抱歉代码太长,但我必须展示所有内容,即使重点放在代码的最后一部分)
我的问题很简单:
为什么它一直采用第 1 行域,而忽略第 2 行、第 3 行、第 4 行等...?
时间 = 延迟必须为 10 或更多,以避免此网站出现验证码问题
如果有人想 运行 这个,你必须编辑 csv 文件的名称,当然,其中的域名格式要少 google.com(不是 www.google.com) .
看起来你总是每次访问同一个索引:
domain = row[0]
索引 0 是第一项,因此您一直得到相同的值。
此 post 解释了在 Python 中使用 for 循环的另一种方法。
Accessing the index in 'for' loops?
我陷入了循环,我不知道要更改什么才能使我的代码正常工作... 问题出在 CSV 文件上,我的文件包含域列表(freedommortgage.com、google.com、amd.com 等),所以当我 运行 代码时,一开始一切都很好,但是然后它一直向我发送相同的结果:
freedommortgage.com 的每月总访问量为 1.10M
So here is my line:
import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time
# setting the firefox driver
def init_driver():
driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
driver.implicitly_wait(10)
return driver
# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
urllib.urlretrieve(captcha_src, "captcha.jpg")
captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84") # 2captcha API KEY
captcha_answer = captcha.solve("captcha.jpg")
wait = WebDriverWait(driver, 10)
captcha_input_box = wait.until(
EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
captcha_input_box.send_keys(captcha_answer)
driver.implicitly_wait(10)
captcha_input_box.submit()
# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
# short method - inputting the domain in the url
if short_method:
driver.get("https://www.similarweb.com/website/" + domain)
else:
driver.get("https://www.similarweb.com")
attempt = 0
# trying 3 times before quiting (due to second refresh by the website that clears the search box)
while attempt < 1:
try:
captcha_body_page = driver.find_elements_by_class_name("block-page")
driver.implicitly_wait(10)
if captcha_body_page:
print("Captcha ahead, solving the captcha, it may take a few seconds")
captcha_solver(driver)
print("Captcha solved! the program will continue shortly")
time.sleep(20) # to prevent second refresh affecting the upcoming elements finding after captcha solved
# for normal method, inputting the domain in the searchbox instead of url
if not short_method:
input_element = driver.find_element_by_id("js-swSearch-input")
input_element.click()
input_element.send_keys(domain)
input_element.submit()
wait = WebDriverWait(driver, 10)
time.sleep(10)
total_visits = wait.until(
EC.presence_of_element_located((By.XPATH, "//span[@class='engagementInfo-valueNumber js-countValue']")))
total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
time.sleep(10)
print('\n' + total_visits_line)
except TimeoutException:
print("Box or Button or Element not found in similarweb while checking %s" % domain)
attempt += 1
print("attempt number %d... trying again" % attempt)
# main
if __name__ == "__main__":
with open('bigdomains.csv', 'rt') as f:
reader = csv.reader(f)
driver = init_driver()
for row in reader:
domain = row[0]
lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
# short method, False will take the normal method
(抱歉代码太长,但我必须展示所有内容,即使重点放在代码的最后一部分)
我的问题很简单:
为什么它一直采用第 1 行域,而忽略第 2 行、第 3 行、第 4 行等...?
时间 = 延迟必须为 10 或更多,以避免此网站出现验证码问题
如果有人想 运行 这个,你必须编辑 csv 文件的名称,当然,其中的域名格式要少 google.com(不是 www.google.com) .
看起来你总是每次访问同一个索引:
domain = row[0]
索引 0 是第一项,因此您一直得到相同的值。
此 post 解释了在 Python 中使用 for 循环的另一种方法。
Accessing the index in 'for' loops?