使用 lxml 解析命名空间 HTML?
Using lxml to parse namepaced HTML?
这让我快疯了,我已经为此苦苦挣扎了好几个小时。任何帮助将非常感激。
我正在使用 PyQuery 1.2.9 (which is built on top of lxml
) to scrape this URL。我只想获得 .linkoutlist
部分中所有链接的列表。
这是我的完整请求:
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links
但是 return 是一个空数组。如果我改为使用此查询:
links = doc('#maincontent .linkoutlist')
然后我得到这个 HTML:
<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
<h4>Full Text Sources</h4>
<ul>
<li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125" ref="itool=Abstract&PrId=3159&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Lippincott Williams & Wilkins</a></li>
<li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui" ref="itool=Abstract&PrId=3682&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
</ul>
<h4>Other Literature Sources</h4>
...
</div>
所以父选择器 return HTML 有很多 <a>
标签。这似乎也是有效的 HTML。
更多的实验表明,出于某种原因,lxml 不喜欢开头 div 上的 xmlns
属性。
我怎样才能在 lxml 中忽略它,并像常规一样解析它 HTML?
更新:尝试ns_clean
,仍然失败:
parser = etree.XMLParser(ns_clean=True)
tree = etree.parse(StringIO(response.content), parser)
sel = CSSSelector('#maincontent .rprt_all a')
print sel(tree)
如果我没记错的话,我自己前一段时间也遇到过类似的问题。您可以通过将名称空间映射到 None
来 "ignore" 名称空间,如下所示:
sel = CSSSelector('#maincontent .rprt_all a', namespaces={None: "http://www.w3.org/1999/xhtml"})
祝你好运,让标准 XML/DOM 解析在大多数 HTML 上工作。您最好的选择是使用 BeautifulSoup(pip install beautifulsoup4
或 easy_install beautifulsoup4
),它对错误构建的结构有很多处理。也许是这样的?
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
bs = BeautifulSoup(response.content)
div = bs.find('div', class_='linkoutlist')
links = [ a['href'] for a in div.find_all('a') ]
>>> links
['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125', 'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui', 'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&ln_o=linkout', 'http://www.diseaseinfosearch.org/result/2199', 'http://www.nlm.nih.gov/medlineplus/antidepressants.html', 'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs+hsdb:@term+@rn+24219-97-4']
我知道这不是您想要使用的库,但在涉及 DOM 时,我曾多次撞墙。 BeautifulSoup 的创建者已经规避了许多在野外容易发生的边缘情况。
您需要处理命名空间,包括一个空的。
工作解决方案:
from pyquery import PyQuery as pq
import requests
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
print link.attrib.get("title", "No title")
打印与选择器匹配的所有链接的标题:
Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource
或者,只需将 parser
设置为 "html"
并忘记命名空间:
links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
print link.attrib.get("title", "No title")
这让我快疯了,我已经为此苦苦挣扎了好几个小时。任何帮助将非常感激。
我正在使用 PyQuery 1.2.9 (which is built on top of lxml
) to scrape this URL。我只想获得 .linkoutlist
部分中所有链接的列表。
这是我的完整请求:
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links
但是 return 是一个空数组。如果我改为使用此查询:
links = doc('#maincontent .linkoutlist')
然后我得到这个 HTML:
<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
<h4>Full Text Sources</h4>
<ul>
<li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125" ref="itool=Abstract&PrId=3159&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Lippincott Williams & Wilkins</a></li>
<li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui" ref="itool=Abstract&PrId=3682&uid=15107654&db=pubmed&log$=linkoutlink&nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
</ul>
<h4>Other Literature Sources</h4>
...
</div>
所以父选择器 return HTML 有很多 <a>
标签。这似乎也是有效的 HTML。
更多的实验表明,出于某种原因,lxml 不喜欢开头 div 上的 xmlns
属性。
我怎样才能在 lxml 中忽略它,并像常规一样解析它 HTML?
更新:尝试ns_clean
,仍然失败:
parser = etree.XMLParser(ns_clean=True)
tree = etree.parse(StringIO(response.content), parser)
sel = CSSSelector('#maincontent .rprt_all a')
print sel(tree)
如果我没记错的话,我自己前一段时间也遇到过类似的问题。您可以通过将名称空间映射到 None
来 "ignore" 名称空间,如下所示:
sel = CSSSelector('#maincontent .rprt_all a', namespaces={None: "http://www.w3.org/1999/xhtml"})
祝你好运,让标准 XML/DOM 解析在大多数 HTML 上工作。您最好的选择是使用 BeautifulSoup(pip install beautifulsoup4
或 easy_install beautifulsoup4
),它对错误构建的结构有很多处理。也许是这样的?
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
bs = BeautifulSoup(response.content)
div = bs.find('div', class_='linkoutlist')
links = [ a['href'] for a in div.find_all('a') ]
>>> links
['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125', 'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui', 'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&ln_o=linkout', 'http://www.diseaseinfosearch.org/result/2199', 'http://www.nlm.nih.gov/medlineplus/antidepressants.html', 'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs+hsdb:@term+@rn+24219-97-4']
我知道这不是您想要使用的库,但在涉及 DOM 时,我曾多次撞墙。 BeautifulSoup 的创建者已经规避了许多在野外容易发生的边缘情况。
您需要处理命名空间,包括一个空的。
工作解决方案:
from pyquery import PyQuery as pq
import requests
response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
print link.attrib.get("title", "No title")
打印与选择器匹配的所有链接的标题:
Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource
或者,只需将 parser
设置为 "html"
并忘记命名空间:
links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
print link.attrib.get("title", "No title")