深入了解 xpath node()
Going deeper with xpath node()
我试图找到维基百科页面上段落内所有超链接的周围文本,而我这样做的方法涉及使用 xpath tree.xpath("//p/node()")
。大多数链接都运行良好,而且我能够找到大多数 <Element a at $mem_location$>
的东西。但是,如果超链接是斜体的(请参见下面的示例),xpath node()
只会将其视为 <Element i at $mem_location>
,并且看起来不会更深。
这导致我的代码丢失超链接,并弄乱了页面其余部分的索引。
例如:
<p>The closely related term, <a href="/wiki/title="Mange">mange</a>,
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a>
(pets) and also livestock and wild mammals, whenever hair-loss is involved.
<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i>
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i>
species are involved in mange, both of these genera are also involved in human skin diseases (by
convention only, not called mange). <i>Sarcoptes</i> in humans is especially
severe symptomatically, and causes the condition known as
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>
node()
正确地抓取了 "Mange"、"Domestic animal" 和 "Scabies",但几乎跳过了 "Sarcoptes" 和 "Demodex" 并搞砸了索引,因为我要过滤掉 <Element a at $mem_location$>
而不是 <Element i at $mem_location$>
的节点。
有没有办法用 node()
看得更深?我在文档中找不到任何关于它的信息。
编辑:我的 xpath 现在是 "//p/node()"
,但它只抓取最外层的元素层。大多数时候它是 <a>
,这很好,但如果它被包裹在 <i>
层中,它只会抓住那个。我在问是否有一种方法可以更深入地检查,以便我可以在 <i>
包装器中找到 <a>
。
相关代码如下:
树 = etree.HTML(阅读)
titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()'))
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs
t = list(tree.xpath("//p/node()"))
b_count = 0
a_count = 0
test = []
for items in t:
print items
items = str(items)
if "<Element b" in str(items):
test.append(b[b_count])
b_count += 1
continue
if "<Element a" in str(items):
test.append((hyperlinks[a_count],titles[a_count]))
a_count +=1
continue
if "<Element " not in items:
pattern = re.compile('(\t(.*?)\n)')
look = pattern.search(str(items))
if look != None: #if there is a match
test.append(look.group().partition("\t")[2].partition("\n")[0])
period_pattern = re.compile("(\t(.*?)\.)")
look_period = period_pattern.search(str(items))
if look_period != None:
test.append(look_period.group().partition("\t")[2])
我想不出一个直接的 xpath 可以做到这一点,但你总是可以遍历内容并像这样过滤掉元素 -
for i,x in enumerate(t):
if x.tag == i:
aNodes = x.find('a')
if aNodes is not None and len(aNodes) > 0:
del t[i]
for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements.
t.insert(i+j,y)
这也可以在单个 i
中处理多个 a
,例如 <i><a>something</a><a>blah</a></i>
我试图找到维基百科页面上段落内所有超链接的周围文本,而我这样做的方法涉及使用 xpath tree.xpath("//p/node()")
。大多数链接都运行良好,而且我能够找到大多数 <Element a at $mem_location$>
的东西。但是,如果超链接是斜体的(请参见下面的示例),xpath node()
只会将其视为 <Element i at $mem_location>
,并且看起来不会更深。
这导致我的代码丢失超链接,并弄乱了页面其余部分的索引。
例如:
<p>The closely related term, <a href="/wiki/title="Mange">mange</a>,
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a>
(pets) and also livestock and wild mammals, whenever hair-loss is involved.
<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i>
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i>
species are involved in mange, both of these genera are also involved in human skin diseases (by
convention only, not called mange). <i>Sarcoptes</i> in humans is especially
severe symptomatically, and causes the condition known as
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>
node()
正确地抓取了 "Mange"、"Domestic animal" 和 "Scabies",但几乎跳过了 "Sarcoptes" 和 "Demodex" 并搞砸了索引,因为我要过滤掉 <Element a at $mem_location$>
而不是 <Element i at $mem_location$>
的节点。
有没有办法用 node()
看得更深?我在文档中找不到任何关于它的信息。
编辑:我的 xpath 现在是 "//p/node()"
,但它只抓取最外层的元素层。大多数时候它是 <a>
,这很好,但如果它被包裹在 <i>
层中,它只会抓住那个。我在问是否有一种方法可以更深入地检查,以便我可以在 <i>
包装器中找到 <a>
。
相关代码如下: 树 = etree.HTML(阅读)
titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()'))
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs
t = list(tree.xpath("//p/node()"))
b_count = 0
a_count = 0
test = []
for items in t:
print items
items = str(items)
if "<Element b" in str(items):
test.append(b[b_count])
b_count += 1
continue
if "<Element a" in str(items):
test.append((hyperlinks[a_count],titles[a_count]))
a_count +=1
continue
if "<Element " not in items:
pattern = re.compile('(\t(.*?)\n)')
look = pattern.search(str(items))
if look != None: #if there is a match
test.append(look.group().partition("\t")[2].partition("\n")[0])
period_pattern = re.compile("(\t(.*?)\.)")
look_period = period_pattern.search(str(items))
if look_period != None:
test.append(look_period.group().partition("\t")[2])
我想不出一个直接的 xpath 可以做到这一点,但你总是可以遍历内容并像这样过滤掉元素 -
for i,x in enumerate(t):
if x.tag == i:
aNodes = x.find('a')
if aNodes is not None and len(aNodes) > 0:
del t[i]
for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements.
t.insert(i+j,y)
这也可以在单个 i
中处理多个 a
,例如 <i><a>something</a><a>blah</a></i>