如何从像彭博社这样的安全网站提取数据
How to extract data from a secure website like bloomberg
我正在尝试在此 url 上抓取项目:
"https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker"
我只想获得标题和发布日期,
你可以给我的任何示例代码甚至飞溅等
到目前为止我试过的是这个
def parse(self, response):
yield scrapy.Request('https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker -H x-crawlera-use-https:1',
headers={'X-Crawlera-Session': create,
'X-Crawlera-Timeout': 40000,
'X-Crawlera-Max-Retries': 5,
'X-Crawlera-Cookies': disable,
'X-Crawlera-Session': self.session_id
},
callback=self.parse_sub,
)
def parse_sub(self, response):
response.xpath("//h1[@class = 'lede-text-v2__hed']").extract_first()
response.xpath("//meta[@property = 'og:title']/@content").extract_first()
response.xpath("//time[@class = 'article-timestamp']/@datetime").extract_first()
print(response.text)
我也在用爬虫,但它一直检测我是机器人
仅使用 selenium 提取 标题 即 Tesla 在巴克莱称其为“小众汽车制造商”时再次受到打击以及 发布日期 即 2019 年 5 月 30 日,5:26 下午 GMT+5:30 你必须诱导 WebDriverWait for the visibility_of_element_located()
你可以使用以下解决方案:
代码块
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker')
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='markets']//following:: h1[1]"))).get_attribute("innerHTML"))
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='markets']//following:: h1[1]//following::div[@class='lede-text-v2__times']/time[@itemprop='datePublished']"))).get_attribute("innerHTML"))
driver.quit()
控制台输出:
Tesla Dealt Another Blow When Barclays Calls It a ‘Niche Carmaker’
May 30, 2019, 5:26 PM GMT+5:30
注意:您必须添加以下导入:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
我正在尝试在此 url 上抓取项目:
"https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker"
我只想获得标题和发布日期, 你可以给我的任何示例代码甚至飞溅等
到目前为止我试过的是这个
def parse(self, response):
yield scrapy.Request('https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker -H x-crawlera-use-https:1',
headers={'X-Crawlera-Session': create,
'X-Crawlera-Timeout': 40000,
'X-Crawlera-Max-Retries': 5,
'X-Crawlera-Cookies': disable,
'X-Crawlera-Session': self.session_id
},
callback=self.parse_sub,
)
def parse_sub(self, response):
response.xpath("//h1[@class = 'lede-text-v2__hed']").extract_first()
response.xpath("//meta[@property = 'og:title']/@content").extract_first()
response.xpath("//time[@class = 'article-timestamp']/@datetime").extract_first()
print(response.text)
我也在用爬虫,但它一直检测我是机器人
仅使用 selenium 提取 标题 即 Tesla 在巴克莱称其为“小众汽车制造商”时再次受到打击以及 发布日期 即 2019 年 5 月 30 日,5:26 下午 GMT+5:30 你必须诱导 WebDriverWait for the visibility_of_element_located()
你可以使用以下解决方案:
代码块
from selenium import webdriver driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe') driver.get('https://www.bloomberg.com/news/articles/2019-05-30/tesla-dealt-another-blow-as-barclays-sees-it-as-niche-carmaker') print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='markets']//following:: h1[1]"))).get_attribute("innerHTML")) print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='markets']//following:: h1[1]//following::div[@class='lede-text-v2__times']/time[@itemprop='datePublished']"))).get_attribute("innerHTML")) driver.quit()
控制台输出:
Tesla Dealt Another Blow When Barclays Calls It a ‘Niche Carmaker’ May 30, 2019, 5:26 PM GMT+5:30
注意:您必须添加以下导入:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC