Web 抓取 - 通过 "brother" 标签中的文本获取标签 - 美丽的汤
Web scraping - get tag through text in "brother" tag - beautiful soup
我试图在维基百科中获取 table 中的文本,但我会在很多情况下这样做(在这种情况下是书籍)。我想获取图书类型。
Html code for the page
当文本在流派中时,我需要提取包含流派的 td。
我这样做了:
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you
在这种特殊情况下,您无需为所有这些烦恼。试试看:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
输出:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
从这里您可以使用标准 pandas 方法提取您需要的任何内容。
我试图在维基百科中获取 table 中的文本,但我会在很多情况下这样做(在这种情况下是书籍)。我想获取图书类型。
Html code for the page
当文本在流派中时,我需要提取包含流派的 td。
我这样做了:
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you
在这种特殊情况下,您无需为所有这些烦恼。试试看:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
输出:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
从这里您可以使用标准 pandas 方法提取您需要的任何内容。