网页抓取语言:如何进行分页抓取?
Web Scraping Language: How to do a paginated crawl?
我正在尝试 运行 以下是 goto flipkart,抓取所有产品链接并提取产品、价格和描述。但是,这只能抓取一页,我想重复抓取所有页面,例如)第 1、2、3 页...等等
GOTO flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
CRAWL //div[2]/div[2]/div[1]/div//div[1]/a[@class="_2cLu-l"][1]
EXTRACT {
"product": "//span[@class=\"_35KyD6\"][1]",
"price": "//div[@class=\"_1vC4OE _3qQ9m1\"][1]",
"description": "//div[@class=\"_3u-uqB\"][1]"
}
您需要在分页器前添加 [[xpath_for_nextpage_element]]. In this case the xpath for the "next page" link is
//nav/a[11]/span. You wrap
[[and
]]around it and put it right after the
CRAWL` 语句.所以我们得到:[[//nav/a[11]/span]]
GOTO flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
CRAWL [[//nav/a[11]/span]] //div[2]/div[2]/div[1]/div//div[1]/a[@class="_2cLu-l"][1]
EXTRACT {
"product": "//span[@class=\"_35KyD6\"][1]",
"price": "//div[@class=\"_1vC4OE _3qQ9m1\"][1]",
"description": "//div[@class=\"_3u-uqB\"][1]"
}
这实际上是一个抓取所有产品信息的抓取工具。
我正在尝试 运行 以下是 goto flipkart,抓取所有产品链接并提取产品、价格和描述。但是,这只能抓取一页,我想重复抓取所有页面,例如)第 1、2、3 页...等等
GOTO flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
CRAWL //div[2]/div[2]/div[1]/div//div[1]/a[@class="_2cLu-l"][1]
EXTRACT {
"product": "//span[@class=\"_35KyD6\"][1]",
"price": "//div[@class=\"_1vC4OE _3qQ9m1\"][1]",
"description": "//div[@class=\"_3u-uqB\"][1]"
}
您需要在分页器前添加 [[xpath_for_nextpage_element]]. In this case the xpath for the "next page" link is
//nav/a[11]/span. You wrap
[[and
]]around it and put it right after the
CRAWL` 语句.所以我们得到:[[//nav/a[11]/span]]
GOTO flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
CRAWL [[//nav/a[11]/span]] //div[2]/div[2]/div[1]/div//div[1]/a[@class="_2cLu-l"][1]
EXTRACT {
"product": "//span[@class=\"_35KyD6\"][1]",
"price": "//div[@class=\"_1vC4OE _3qQ9m1\"][1]",
"description": "//div[@class=\"_3u-uqB\"][1]"
}
这实际上是一个抓取所有产品信息的抓取工具。