table 解析在 python 中如何工作？除了那道漂亮的汤，还有什么简单的方法吗？

Question

我想了解如何使用漂亮的汤来提取网页上 table 中特定列下内容的 href link。例如考虑 link：http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015。

在这个页面上 table 和 class wikitable 有一个列标题，我需要提取每个值后面的 href links在列标题下，并将它们放在 excel sheet 中。最好的方法是什么？我有点难以理解美丽的汤 table 解析文档。

Answer 1

您不必真的在树中导航，您可以简单地尝试查看是什么标识了这些行。

就像在这个例子中一样，您要查找的 url 位于 table 和 class="wikitable" 中，因为 table 它们位于 align=center 的 td 标记中，现在我们的链接有了一些独特的标识，我们可以开始提取它们了。

但是你应该考虑到可能存在多个带有 class="wikitable" 的 table 和带有 align=center 的 td 标签，以防你想要第一个或第二个 table ，这取决于您的选择，您将不得不添加额外的过滤器。

用于从 table 中提取所有链接的代码应如下所示：

import urllib2

from bs4 import BeautifulSoup, SoupStrainer


content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()  
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)

links=[] 
for sp in soup.find_all(align="center"):
    a_tag = sp('a')
    if a_tag:
        links.append(a_tag[0].get('href'))

这里还有一点要注意，注意SoupStrainer的使用，它是用来指定一个过滤器来读取你要处理的内容，有助于加快处理速度，尽量不要使用parse_only 这一行的参数：
soup = BeautifulSoup(content, parse_only=filter_tag)
并注意差异。（我注意到了，因为我的电脑没有那么强大）

table 解析在 python 中如何工作？除了那道漂亮的汤，还有什么简单的方法吗？

How does table parsing work in python? Is there an easy way other that beautiful soup?

html

python

excel

parsing

wikipedia