从 python 中的网页读取特定行

Question

在我的代码中，我试图将网页中的第一行文本放入 python 中的变量中。目前，我正在使用 urlopen 获取我想阅读的每个 link 的整个页面。我怎么只读网页上的第一行字。

我的代码：

import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()

我想从网页的以下html代码中提取单词"old car"：

<html>
    <head>
        <link rel="stylesheet">
        <style>
            .norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
            .norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
            .norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
        </style>
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>

Answer 1

如果您要在可能编写方式不同的许多不同网页上执行此操作，您可能会发现 BeautifulSoup 很有帮助。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

正如您在快速入门底部看到的那样，您应该可以从页面中提取所有文本，然后选择您感兴趣的任何行。

请记住，这仅适用于 HTML 文本。一些网页大量使用javascript，requests/BeautifulSoup将无法阅读javascript提供的内容。

Using Requests and BeautifulSoup - Python returns tag with no text

另请参阅我过去遇到的问题，已由用户 avi 澄清：Want to pull a journal title from an RCSB Page using python & BeautifulSoup

Answer 2

使用XPath。这正是我们所需要的。

XPath, the XML Path Language, is a query language for selecting nodes from an XML document.

lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML是其中的一些选项。有很多很多库可以做这种事情。

使用 XPath

基于您现有的代码，类似下面的内容将起作用：

import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()
    tree = html.fromstring(l)
    print tree.xpath("//b/text()")[0]

XPath 查询 //b/text() 基本上是说“从页面上的 <b> 元素获取文本。tree.xpath 函数调用 returns 一个列表，我们 select第一个使用[0]。简单。

关于请求的旁白

Requests library 是用代码阅读网页的最先进技术。它可能会在以后为您省去一些麻烦。

完整的程序可能如下所示：

from lxml import html
import requests

for nn in range(1, 6):
    page = requests.get("http://www.cv.edu/id=%d" % nn)
    tree = html.fromstring(page.text)
    print tree.xpath("//b/text()")[0]

注意事项

这些网址对我不起作用，所以您可能需要修改一下。不过，这个概念是合理的。

除了阅读网页，您可以使用以下方法测试 XPath：

from lxml import html

tree = html.fromstring("""<html>
    <head>
        <link rel="stylesheet">
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")

print tree.xpath("//b/text()")[0] # "Old cars"

从 python 中的网页读取特定行

Reading a particular line from a webpage in python

html

python

webpage

使用 XPath

关于请求的旁白

注意事项