Scrapy 爬取时排除某个标签

Exclude a certain tag in crawling with Scrapy

我正在抓取网页。页面源代码的一部分如下:

                  <div class="accordion-row">
              <h4 class="accordion-title down-arrow">The Problem</h4>
              <div class="accordion-content">
                <p>
  Solving the climate crisis needs us to collaborate at all levels. Climate impacts are generally associated with heavy industries, such as cement, aviation and, of course, energy. But transformation needs all of us, and the creative sector – worth £111.7 billion to the UK in 2018, equivalent to £306 million every day, and £166 billion in the USA – has impacts just as every other sector does, and of course, its influence is immeasurable.<br />
<br />
When JB was founded it was clear that, in order to reduce greenhouse gas emissions, collective ambition needed a common starting whistle, a baseline and a sound roadmap. With no environmental impact data available for the arts (a problem which persists today in most countries)  it was not possible to start this journey. Therefore, advocating the relevance of the climate crisis to culture, and building the tools and resources to take action was critical. We started by:<br />
<br />
1. gathering data on greenhouse gas emissions with tolls co created with culture, collected at scale and over time;<br />
2. championing case studies, solutions and stories of change;<br />
3. building networks, knowledge-chains for the creative community, with events, training, projects and research programs;<br />
4. producing culturally specific resources. <br />
<br />
These foundations have evolved into a rich program that is now exploring creative climate leadership in order to scale the unmet potential of culture in all its manifestations, championing artists particularly through the lens of climate justice and focusing on cultural policy-making as a rapid agent of change. All the data, both quantitative and qualitative, is proving invaluable.<br />
<br />
The climate crisis is a cultural crisis, in that it reflects our deepest values and identities. Environmental injustice is a growing problem and reflects cultural values that have, too often, championed human – and white - supremacy over all else; and therein lies the opportunity. Culture – artists, theatre makers, festival organizers, galleries and museums, poets and story-tellers – should be at the center of climate action, reframing the stories we tell ourselves and offering visions of a regenerative world in tune with the needs of nature and community. One of JB’s most enduring partnerships with a consortium of cultural organizations in Manchester has resulted in culture recognized as vital for the city’s ambitious climate targets. This is a rare recognition of culture’s potential.
</p>
              </div>
            </div>
                                <div class="accordion-row">
              <h4 class="accordion-title down-arrow">The Strategy</h4>
              <div class="accordion-content">
                <p>
  Julie’s Bicycle has at its core a free resource base for anyone in the world to use. The first is a set of carbon calculators co-produced by the UK cultural sector which provide a snapshot of, for example, the carbon footprint of tours, productions, buildings etc. These are used throughout the program for benchmarking and planning. These tools have been used by 5000 organizations across 50 countries. The Creative Green program offers consultancy, environmental reporting, and the world’s first (though now not only) Green Certification for artistic organizations. The Green Tools are a starting point and Alison is developing a much broader, more holistic set of measurement tools, including science-based target tools, to provide to organizations. The Creative Green consultancy is an engine that helps to power the unfunded leadership work – advocacy and new ideas on the front lines of change, the ‘profit maker’ raising money through consultancy fees and royalties to fund policy shift and systems change. <br />
<br />
These Creative Green tools provide a rich source of data and ‘intelligence’ on thousands of organizations and their procedures, progress, and approaches that inform all of Julie’s Bicycle’s policy work and future initiatives – a core part of their effectiveness. Any policy work is tracked, costed, and disseminated. Her research has been peer-reviewed by Oxford University.<br />
<br />
Julie’s Bicycle has worked with many cities to make the links between culture and climate policy clear, and build culture into sustainability strategies. The Manchester Arts Sustainability Team (MAST) found in a study that working with arts and young people was the most effective way to change behavior towards climate change. Together with Julie’s Bicycle, their strategy is now being replicated in six other cities and has received an EU grant to replicate in six European countries.<br />
<br />
She advocates for cultural policy change on a local, national, and international policy level to include culture in climate change plans.  Julie’s Bicycle has a pioneering partnership with Arts Council England, which has required its core grant beneficiaries (currently around 800) to report annually on environmental impacts and have a policy. This requirement, in pace since 2012, has evolved and now includes the Accelerator Programme – 10 innovative projects led by organizations; and the Spotlight Programme, developing science-based targets with 30 of the 60 organizations that produce the majority of carbon emissions. Arts Council England’s policy has catalyzed an overall reduction of 41% of CO2 across the portfolio of  850 organizations, and a cost savings of around £16-20 million. Other national Arts Councils are looking at this model, and JB has presented, especially recently, to many. And beyond impacts alone, climate and the environment more generally are issues of huge concern and interest – indeed, over half of Arts Council England’s portfolio is now creating or commissioning work related to climate and climate justice.<br />
<br />
Finally, she engages with artists and cultural leaders in centering climate work in their creative practice and entrepreneurship. An intensive Creative Climate Leadership program for artists and producers, and commissioning and events on climate justice, aims to build a global network of cultural changemakers. She is keen to build capacity within the creative community and lift up artistic voices as agents of change. <br />
<br />
Alison’s work has catalyzed a powerful industry, and is using the talents and power of that industry – the influential voice of cultural producers and artists – to change the narrative on climate, as well as change national and international climate strategy. She is working with the biggest institutions in government and culture to change their own processes. She has partnered, or supported, sister initiatives in fashion, media and film industries.<br />
<br />
To scale, she is planning to invest £700,000 to update the digital tools and resources and make them relevant for different country contexts.<br />
<br />
Alison also has a keen focus on moving further internationally. She is in talks with potential partners in Ireland, Canada, Spain, Denmark, and Germany, and plans to license the tools to international cultural organizations, and launch an international Creative Green Kitemark. She wants to build an internationally tested model for a Green Deal that has policy influence at COP26 and beyond, publishing a public policy toolkit using her decade of data and findings. Her goal is also to activate a global network of cultural changemakers through her Creative Leadership programs, commissions, and other supports. Eventually, her vision is a series of regional hubs around the world, collaborating on change.
</p>
              </div>
            </div>

为了抓取它,我在我的代码中使用了这一行:introduction = response.css('.accordion-content').extract()

这有效并抓取了数据。但是,它会一次抓取所有内容。

我最理想的是分别抓取手风琴 class 中的部分。因此,例如,我想抓取以 -

开头的段落

 <h4 class="accordion-title down-arrow">The Problem</h4>

分开,以

开头的那个

<h4 class="accordion-title down-arrow">The Strategy</h4>

分别。这是因为我们只需要“策略”部分,而不是所有部分。

我几乎不使用CSS,所以我不确定如何指定这个选择器,以便爬虫只抓取所需的段落。

有人知道吗?

extract() 将 return 一个列表,因此“问题”段落是介绍[0],“策略”段落是介绍[1]。

如果你想单独抓取它们,你可以使用这个:

problem_paragraph = response.css('div.accordion-row:nth-child(1) > div').get()
strategy_paragraph = response.css('div.accordion-row:nth-child(2) > div').get()

您将获得包含 <br> 标签的文本。

为了仅获取每个段落中的文本(不带任何标签),您可以将 xpath 与 string() 一起使用:

problem_paragraph = response.xpath('string((//div[@class="accordion-content"])[1]/p)').get()
strategy_paragraph = response.xpath('string((//div[@class="accordion-content"])[2]/p)').get()