如何使用 soup & python 从维基百科中获取 table 中特定列下的内容

Question

我需要从维基百科的 table 中获取内容指向特定列下的 href 链接。该页面是“http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015”。在这个页面上有一些 table 和 class "wikitable"。我需要它们指向的每一行的 Title 列下的内容链接。我希望将它们复制到 excel sheet 上。

我不知道在特定列下搜索的确切代码，但我走到这一步，我得到了 "Nonetype object is not callable"。我正在使用 bs4。我想至少提取 table 的一部分，这样我就可以缩小到我想要的 Title 列下的 href 链接，但我最终遇到了这个错误。代码如下：

from urllib.request import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015').read())
for row in soup('table', {'class': 'wikitable'})[1].tbody('tr'):
    tds = row('td')
    print (tds[0].string, tds[0].string)

感谢您的一点指导。有人知道吗？

Answer 1

发现 none 类型错误可能与 table 过滤有关。修改后的代码如下：

import urllib2

from bs4 import BeautifulSoup, SoupStrainer


content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()  
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)

links=[] 
for sp in soup.find_all(align="center"):
    a_tag = sp('a')
    if a_tag:
        links.append(a_tag[0].get('href'))

如何使用 soup & python 从维基百科中获取 table 中特定列下的内容

How to get the contents under a particular column in a table from Wikipedia using soup & python

html

python

excel

parsing

beautifulsoup