Scrapy - 抓取所有产品详情
Scrapy - Grab all product details
我需要从此页面获取所有产品详细信息(带有绿色标记):https://sourceforge.net/software/product/Budget-Maestro/
divs = response.xpath("//section[@class='row psp-section m-section-comm-details m-section-emphasized grey']/div[@class='list-outer column']/div")
for div in divs:
detail = div.xpath("./h3/text()").extract_first().strip() + ":"
if detail!="Company Information:":
divs2 = div.xpath(".//div[@class='list']/div")
for div2 in divs2:
dd = [val for val in div2.xpath("./text()").extract() if val.strip('\n').strip().strip('\n')]
for d in dd:
detail = detail + d + ","
detail = detail.strip(",")
product_details = product_details + detail + "|"
product_details = product_details.strip("|")
但它也为我提供了 \n 的一些功能。而且我敢肯定,一定有更聪明、更短的方法来做到这一点。
如果您只需要来自 "Product Details" 的数据,请检查:
In [6]: response.css("section.m-section-comm-details div.list svg").xpath('.//following-sibling::text()').extract()
Out[6]:
[u' SaaS\n ',
u' Windows\n ',
u' Live Online ',
u' In Person ',
u' Online ',
u' Business Hours ']
使用这个,
divs = [div.strip() for div in response.xpath('//*[contains(@class, "has-feature")]/text()').extract() if div.strip()]
现在Div是
[u'Accounts Payable', u'Accounts Receivable', u'Cash Management', u'General Ledger', u'Payroll', u'Project Accounting', u'"What If" Scenarios', u'Balance Sheet', u'Capital Asset Planning', u'Cash Management', u'Consolidation / Roll-Up', u'Forecasting', u'General Ledger', u'Income Statements', u'Multi-Company', u'Multi-Department / Project', u'Profit / Loss Statement', u'Project Budgeting', u'Run Rate Tracking', u'Version Control',u'"What If" Scenarios', u'Balance Sheet', u'Cash Management', u'Consolidation / Roll-Up', u'Forecasting', u'General Ledger', u'Income Statements', u'Profit / Loss Statement']
我希望这就是你想要的。现在遍历这个列表,你有逻辑吗:)
我需要从此页面获取所有产品详细信息(带有绿色标记):https://sourceforge.net/software/product/Budget-Maestro/
divs = response.xpath("//section[@class='row psp-section m-section-comm-details m-section-emphasized grey']/div[@class='list-outer column']/div")
for div in divs:
detail = div.xpath("./h3/text()").extract_first().strip() + ":"
if detail!="Company Information:":
divs2 = div.xpath(".//div[@class='list']/div")
for div2 in divs2:
dd = [val for val in div2.xpath("./text()").extract() if val.strip('\n').strip().strip('\n')]
for d in dd:
detail = detail + d + ","
detail = detail.strip(",")
product_details = product_details + detail + "|"
product_details = product_details.strip("|")
但它也为我提供了 \n 的一些功能。而且我敢肯定,一定有更聪明、更短的方法来做到这一点。
如果您只需要来自 "Product Details" 的数据,请检查:
In [6]: response.css("section.m-section-comm-details div.list svg").xpath('.//following-sibling::text()').extract()
Out[6]:
[u' SaaS\n ',
u' Windows\n ',
u' Live Online ',
u' In Person ',
u' Online ',
u' Business Hours ']
使用这个,
divs = [div.strip() for div in response.xpath('//*[contains(@class, "has-feature")]/text()').extract() if div.strip()]
现在Div是
[u'Accounts Payable', u'Accounts Receivable', u'Cash Management', u'General Ledger', u'Payroll', u'Project Accounting', u'"What If" Scenarios', u'Balance Sheet', u'Capital Asset Planning', u'Cash Management', u'Consolidation / Roll-Up', u'Forecasting', u'General Ledger', u'Income Statements', u'Multi-Company', u'Multi-Department / Project', u'Profit / Loss Statement', u'Project Budgeting', u'Run Rate Tracking', u'Version Control',u'"What If" Scenarios', u'Balance Sheet', u'Cash Management', u'Consolidation / Roll-Up', u'Forecasting', u'General Ledger', u'Income Statements', u'Profit / Loss Statement']
我希望这就是你想要的。现在遍历这个列表,你有逻辑吗:)