使用 scrapy 从多个网站中查找特定文本

Question

我想 crawl/check 多个网站（在同一域中）针对特定关键字。我找到了这个脚本，但找不到如何添加要搜索的特定关键字。脚本需要做的是找到关键字，并给出找到 link 的结果。谁能指出我可以在哪里阅读更多相关信息？我一直在阅读 scrapy's documentation，但我似乎找不到这个。

谢谢。

class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
    self.page_number = starting_number

def start_requests(self):
    # generate page IDs from 1000 down to 501
    for i in range (self.page_number, number_of_pages, -1):
        yield Request(url = URL % i, callback=self.parse)

def parse(self, response):
    **parsing data from the webpage**

Answer 1

您需要使用一些解析器或正则表达式来查找您在响应中查找的文本 body。

每个 scrapy 回调方法都包含 response object 中的响应 body，您可以使用 response.body 检查（例如 parse方法），那么你将不得不使用一些 regex 或更好的 xpath 或 css 选择器去你的文本路径，知道页面的 xml 结构已爬网。

Scrapy 允许您使用 response object 作为选择器，因此您可以使用 response.xpath('//head/title/text()') 转到页面标题。

希望对您有所帮助。

使用 scrapy 从多个网站中查找特定文本

Using scrapy to find specific text from multiple websites

web-crawler

keyword

extraction

scrapy

web