html脚本中如何使用xpath提取多个标签中的文本
How to use xpath to extract text in more than one label of html scripts
假设我有很多这样的 html 脚本:
<div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>
我要用xpath提取文字:你傻啊!我是超人!
但是,如果我使用
xpath('//div[@id="novelintro"]/text()').extract()
我只能得到"you are foolish!"
当我使用时:
xpath('//div[@id="novelintro"]/font/text()').extract()"
我只能得到"I am superman!"
所以如果你可以只使用一个 xpath 表达式来提取整个句子 "you are foolish! I am superman!"
更倒霉的是,在上面的html脚本中,是“<font>
”标签,但在我的另一个脚本中,还有很多其他标签,例如:
在以下脚本中提取"hi girl I love you!":
<div style="clear:both" id="novelintro" itemprop="description">hi girl<legend >I love you!</legend></div>
在以下脚本中提取"If I marry your mother then I am your father!":
<div style="clear:both" id="novelintro" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>
是否可以仅使用一个 xpath 表达式来适配所有 html 脚本?
如果您的文档是:
<outer>This is outer text.<inner>And this is inner text.</inner>More outer text.</outer>
并且您使用 xpath 表达式:/outer//text()
(阅读:'outer' 下面的任何文本),结果是一个如下所示的列表:
This is outer text.
-----------------------
And this is inner text.
-----------------------
More outer text.
可以使用XPath的string()
函数,递归地将单个节点转为字符串(可选的.
指的是当前节点):
from scrapy.selector import HtmlXPathSelector
def node_to_string(node):
return node.xpath("string(.)").extract()[0]
# ------------------------------------------------------
body = """<body>
<div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>
<div style="clear:both" id="novelintro2" itemprop="description">hi girl<legend >I love you!</legend></div>
<div style="clear:both" id="novelintro3" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>
</body>"""
hxs = HtmlXPathSelector(text=body)
# single target use
print node_to_string(hxs.xpath('//div[@id="novelintro"]'))
print
# multi target use
for div in hxs.xpath('//body/div'):
print node_to_string(div)
print
# alternatively
print [node_to_string(n) for n in hxs.xpath('//body/div')]
print
输出
you are foolish!I am superman!
you are foolish!I am superman!
hi girlI love you!
If I marry your motherthen I am your father!
[u'you are foolish!I am superman!', u'hi girlI love you!', u'If I marry your motherthen I am your father!']
请注意,由于源代码中缺少空格,因此缺少空格。 string()
以与浏览器相同的方式处理空格。
假设我有很多这样的 html 脚本:
<div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>
我要用xpath提取文字:你傻啊!我是超人!
但是,如果我使用
xpath('//div[@id="novelintro"]/text()').extract()
我只能得到"you are foolish!"
当我使用时:
xpath('//div[@id="novelintro"]/font/text()').extract()"
我只能得到"I am superman!"
所以如果你可以只使用一个 xpath 表达式来提取整个句子 "you are foolish! I am superman!"
更倒霉的是,在上面的html脚本中,是“<font>
”标签,但在我的另一个脚本中,还有很多其他标签,例如:
在以下脚本中提取"hi girl I love you!":
<div style="clear:both" id="novelintro" itemprop="description">hi girl<legend >I love you!</legend></div>
在以下脚本中提取"If I marry your mother then I am your father!":
<div style="clear:both" id="novelintro" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>
是否可以仅使用一个 xpath 表达式来适配所有 html 脚本?
如果您的文档是:
<outer>This is outer text.<inner>And this is inner text.</inner>More outer text.</outer>
并且您使用 xpath 表达式:/outer//text()
(阅读:'outer' 下面的任何文本),结果是一个如下所示的列表:
This is outer text.
-----------------------
And this is inner text.
-----------------------
More outer text.
可以使用XPath的string()
函数,递归地将单个节点转为字符串(可选的.
指的是当前节点):
from scrapy.selector import HtmlXPathSelector
def node_to_string(node):
return node.xpath("string(.)").extract()[0]
# ------------------------------------------------------
body = """<body>
<div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>
<div style="clear:both" id="novelintro2" itemprop="description">hi girl<legend >I love you!</legend></div>
<div style="clear:both" id="novelintro3" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>
</body>"""
hxs = HtmlXPathSelector(text=body)
# single target use
print node_to_string(hxs.xpath('//div[@id="novelintro"]'))
print
# multi target use
for div in hxs.xpath('//body/div'):
print node_to_string(div)
print
# alternatively
print [node_to_string(n) for n in hxs.xpath('//body/div')]
print
输出
you are foolish!I am superman! you are foolish!I am superman! hi girlI love you! If I marry your motherthen I am your father! [u'you are foolish!I am superman!', u'hi girlI love you!', u'If I marry your motherthen I am your father!']
请注意,由于源代码中缺少空格,因此缺少空格。 string()
以与浏览器相同的方式处理空格。