Web 抓取 - 通过 "brother" 标签中的文本获取标签 - 美丽的汤

Web scraping - get tag through text in "brother" tag - beautiful soup

我试图在维基百科中获取 table 中的文本,但我会在很多情况下这样做(在这种情况下是书籍)。我想获取图书类型。

Html code for the page

当文本在流派中时,我需要提取包含流派的 td。

我这样做了:

page2 = urllib.request.urlopen(url2)

soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
    for tr in table.findAll('tr')[5:6]:
        for td in tr.findAll('td'):
            print(td.getText(separator="\n"))```

This gets me the genre but only in some pages due to the row count which differs. 

Example of page where this does not work 

https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)

Anyone knows how to search through string with "genre"? Thank you

在这种特殊情况下,您无需为所有这些烦恼。试试看:

import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])

输出:

                     0                                       1
0   First edition cover                     First edition cover
1                Author                          J. D. Salinger
2          Cover artist               E. Michael Mitchell[1][2]
3               Country                           United States
4              Language                                 English
5                 Genre  Realistic fictionComing-of-age fiction
6             Published                           July 16, 1951
7             Publisher               Little, Brown and Company
8            Media type                                   Print
9                 Pages                          234 (may vary)
10                 OCLC                                  287628
11        Dewey Decimal                                  813.54

从这里您可以使用标准 pandas 方法提取您需要的任何内容。