Scrapy：获取 <td> 标签内某个 <a> 标签之后的所有标签

Question

我有以下页面需要用 Scrapy 抓取：http://www.genecards.org/cgi-bin/carddisp.pl?gene=B2M

我的任务是从 GeneCard 中获取摘要，在 HTML 中，它看起来像这样：

<td>
    <a name="summaries"></a>
    <br >
    <b>Entrez Gene summary for <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=567" title="See EntrezGene 
    entry for B2M" target="aaa" 
    onClick="doFocus('aaa')">B2M</a> Gene:</b><br >
    <dd> This gene encodes a serum protein found in association with the major histocompatibility complex (MHC) class I
        <br >
    <dd>heavy chain on the surface of nearly all nucleated cells. The protein has a predominantly beta-pleated sheet
        <br >
    <dd>structure that can form amyloid fibrils in some pathological conditions. A mutation in this gene has been shown<br ><dd>to result in hypercatabolic hypoproteinemia.(provided by RefSeq, Sep 2009) </dd><br ><b>GeneCards Summary for B2M Gene:</b><br ><dd> B2M (beta-2-microglobulin) is a protein-coding gene. Diseases associated with B2M include <i><a href="http://www.malacards.org/card/balkan_nephropathy" title="See balkan nephropathy at MalaCards" target="aaa" 
        onClick="doFocus('aaa')">balkan nephropathy</a></i>, and <i><a href="http://www.malacards.org/card/plasmacytoma" title="See plasmacytoma at MalaCards" target="aaa" 
        onClick="doFocus('aaa')">plasmacytoma</a></i>. GO annotations related to this gene include <i>identical protein binding</i>.</dd><br ><Font size=-1><b>UniProtKB/Swiss-Prot: </b></font><a href="http://www.uniprot.org/uniprot/P61769#section_comments" target="aaa" 
                onClick="doFocus('aaa')">B2MG_HUMAN, P61769</a></font><dd><b>Function</b>:  Component of the class I major histocompatibility complex (MHC). Involved in the presentation of peptide<br >
    <dd>antigens to the immune system</dd>

现在，我想让 scrapy 从中获取文本。但是，我不知道如何让 Scrapy 达到 select a <td>，因为它里面有 <a name="summaries">。 Scrapy 是否有一个未记录的选择器功能，它可以让你 select 一个基于它确实（或不）明确包含特定子标签的事实的标签？

Answer 1

已更新：

你可以使用 xpath 从 sel.xpath('.//a[@name="summaries"]') 开始...我对这个 mac 没有任何了解，所以我正在使用 lxml，事实上，在lxml中你可以使用getparent()，iterslibings ...等等。确实有很多方法，这里是示例之一：

from lxml import html

s = '... your very long html page source ...'
tree = html.fromstring(s)

for a in tree.xpath('.//a[@name="summaries"]'):
    td = a.getparent() # getparent() which returns td
    # iterchildren() get all children nodes under td 
    for node in td.iterchildren():
        print node.text

结果：

None


None
Summaries
(According to 
None
None
Entrez Gene summary for 
None
 This gene encodes a serum protein found in association with the major histocompatibility complex (MHC) class I

或者，使用itersiblings()获取<a>周围的所有兄弟节点：

for a in tree.xpath('.//a[@name="summaries"]'):
    for node in t.itersiblings():
        print node.text

...

或者，如果您想要查找父 td 中实际包含的所有文本，您可以只使用 xpath //text() 来获取所有文本：

for a in tree.xpath('.//a[@name="summaries"]'):
    print a.xpath('./..//text()')

很长的结果：

['\n\t', '\n', '\n', 'Jump to Section...', '\n', 'Aliases', '\n', 'Databases', '\n', 'Disorders / Diseases', '\n', 'Domains / Families', '\n', 'Drugs / Compounds', '\n', 'Expression', '\n', 'Function', '\n', 'Genomic Views', '\n', 'Intellectual Property', '\n', 'Localization', '\n', 'Orthologs', '\n', 'Paralogs', '\n', 'Pathways / Interactions', '\n', 'Products', '\n', 'Proteins', '\n', 'Publications', '\n', 'Search Box', '\n', 'Summaries', '\n', 'Transcripts', '\n', 'Variants', '\n', 'TOP', '\n', 'BOTTOM', '\n', '\n', '\n', 'Summaries', 'for B2M gene', '(According to ', 'Entrez Gene', ',\n\t\t', 'GeneCards', ',\n\t\t', 'Tocris Bioscience', ',\n\t\t', "Wikipedia's", ' \n\t\t', 'Gene Wiki', ',\n\t\t', 'PharmGKB', ',', '\n\t\t', 'UniProtKB/Swiss-Prot', ',\n\t\tand/or \n\t\t', 'UniProtKB/TrEMBL', ')\n\t\t', 'About This Section', 'Try', 'GeneCards Plus']
['Entrez Gene summary for ', 'B2M', ' Gene:', ' This gene encodes a serum protein found in association with the major histocompatibility complex (MHC) class I', 'heavy chain on the surface of nearly all nucleated cells. The protein has a predominantly beta-pleated sheet', 'structure that can form amyloid fibrils in some pathological conditions. A mutation in this gene has been shown', 'to result in hypercatabolic hypoproteinemia.(provided by RefSeq, Sep 2009) ', 'GeneCards Summary for B2M Gene:', ' B2M (beta-2-microglobulin) is a protein-coding gene. Diseases associated with B2M include ', 'balkan nephropathy', ', and ', 'plasmacytoma', '. GO annotations related to this gene include ', 'identical protein binding', '.', 'UniProtKB/Swiss-Prot: ', 'B2MG_HUMAN, P61769', 'Function', ':  Component of the class I major histocompatibility complex (MHC). Involved in the presentation of peptide', 'antigens to the immune system', 'Gene Wiki entry for ', 'B2M', ' (Beta-2 microglobulin) Gene']

Scrapy：获取 <td> 标签内某个 <a> 标签之后的所有标签

Scrapy: get all tags following a certain <a> tag within a <td> tag

html

python

scrapy

scrapy-spider

已更新：