Web 抓取 - 通过 "brother" 标签中的文本获取标签 - 美丽的汤

Question

我试图在维基百科中获取 table 中的文本，但我会在很多情况下这样做（在这种情况下是书籍）。我想获取图书类型。

Html code for the page

当文本在流派中时，我需要提取包含流派的 td。

我这样做了：

page2 = urllib.request.urlopen(url2)

soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
    for tr in table.findAll('tr')[5:6]:
        for td in tr.findAll('td'):
            print(td.getText(separator="\n"))```

This gets me the genre but only in some pages due to the row count which differs. 

Example of page where this does not work 

https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)

Anyone knows how to search through string with "genre"? Thank you

Answer 1

在这种特殊情况下，您无需为所有这些烦恼。试试看：

import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])

输出：

                     0                                       1
0   First edition cover                     First edition cover
1                Author                          J. D. Salinger
2          Cover artist               E. Michael Mitchell[1][2]
3               Country                           United States
4              Language                                 English
5                 Genre  Realistic fictionComing-of-age fiction
6             Published                           July 16, 1951
7             Publisher               Little, Brown and Company
8            Media type                                   Print
9                 Pages                          234 (may vary)
10                 OCLC                                  287628
11        Dewey Decimal                                  813.54

从这里您可以使用标准 pandas 方法提取您需要的任何内容。

Web 抓取 - 通过 "brother" 标签中的文本获取标签 - 美丽的汤

Web scraping - get tag through text in "brother" tag - beautiful soup

wikipedia

beautifulsoup