在特定 keyword/string 之后使用 Scrapy 抓取内容

Question

我正在尝试在出现特定 keyword/string 后抓取内容。

假设Xpath如下：

   <meta property="og:url" content="https://www.example.com/tshirt/pcid111-31">
   <meta property="og:url" content="https://www.example.com/tshirt/pcid3131-33">
   <meta property="og:url" content="https://www.example.com/tshirt/pcid545424524-84">

1) 如何提取 content 元素内的所有数据，其 property="og:url

2)我还想提取 pcid 之后的任何内容，有人可以提出解决方法吗？

现在确定这是否可行：

item ["example"] =sel.xpath("//meta[@property='og:url']/text()").extract()[0].replace( "*pcid","")

替换是否接受通配符引用？

Answer 1

试试这个

x=len(hxs.select("//meta/@content").extract())

for i in range(x):
    print    hxs.select("//meta/@content").extract()[i].split('pcid')[1]

输出：

111-31

3131-33

545424524-84

Answer 2

这将提取 property="og:url"

元素的 content 属性

og_urls = response.xpath("//meta[@property='og:url']/@content").extract()

要从 url 中提取内容，通常最好使用正则表达式，在您的情况下为：

for url in og_urls:
   id = re.findall("pcid(.+)")  # "pcid(.+)" = any characters after 'pcid'(greedy)
   # re.findall() returns a list and you probably want only the first occurrence and there mostlikely only be one anyway
   id = id[0] if id else ''  
   print(id)

或者您可以在 'pcid' 处拆分 url 并取较晚的值，例如

for url in og_urls:
   id = url.split('pcid')[-1]
   print(id)

在特定 keyword/string 之后使用 Scrapy 抓取内容

Using Scrapy to Scrape Content after a particular keyword/string

xpath

scrapy