.find(text=True) 在 BeautifulSoup4 中如何工作?
How does .find(text=True) work in BeautifulSoup4?
正在尝试从中提取维基百科列表:https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes
使用 BeautifulSoup.
这是我的代码:
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells)==9: # The start and end don't include a <td> tag
for i in range(9):
Data[i].append(cells[i].find(text=True))
除了名称列“新英格兰”飓风中的单个值外,这非常有效。
这是包含该元素的 HTML 代码:
<td><span data-sort-value="New England !"> <a href="/wiki/1938_New_England_hurricane" title="1938 New England hurricane">"New England"</a></span></td>
该飓风中名称的条目是“ ”,我认为 <span>
和 <a>
之间的 space 导致了此问题。
有没有办法在 .find 中解决这个问题?有没有更聪明的方法来访问维基百科中的列表?
以后如何避免这种情况?
将 table
读入数据框的最简单方法是 read_html()
:
import pandas as pd
pd.read_html(wiki)[1]
输出:
Name Dates as aCategory 5 Duration as aCategory 5 Sustainedwind speeds Pressure Areas affected Deaths Damage(USD) Refs
0 "Cuba" October 19, 1924 12 hours 165 mph (270 km/h) 910 hPa (26.87 inHg) Central America, Mexico, CubaFlorida, The Bahamas 90 NaN [12]
1 "San Felipe IIOkeechobee" September 13–14, 1928 12 hours 160 mph (260 km/h) 929 hPa (27.43 inHg) Lesser Antilles, The BahamasUnited States East... 4000 NaN NaN
...
要改进您的示例,您可以执行以下操作:
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = requests.get(wiki).content
soup = BeautifulSoup(page,'lxml')
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
data = []
for row in table.select('tr')[1:-1]:
cells = []
for cell in row.select('td'):
cells.append(cell.get_text('',strip=True))
data.append(cells)
get_text('',strip=True)
将从 td
中获取文本并去除 front/end 中的 space。
这将 规范化 文本,并希望能为您提供所需的内容:-
import urllib
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
# The class of the list in wikipedia
table = soup.find('table', class_="wikitable sortable")
Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 9: # The start and end don't include a <td> tag
for i, cell in enumerate(cells):
Data[i].append(cell.text.strip().replace('"', ''))
print(Data)
正在尝试从中提取维基百科列表:https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes 使用 BeautifulSoup.
这是我的代码:
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells)==9: # The start and end don't include a <td> tag
for i in range(9):
Data[i].append(cells[i].find(text=True))
除了名称列“新英格兰”飓风中的单个值外,这非常有效。 这是包含该元素的 HTML 代码:
<td><span data-sort-value="New England !"> <a href="/wiki/1938_New_England_hurricane" title="1938 New England hurricane">"New England"</a></span></td>
该飓风中名称的条目是“ ”,我认为 <span>
和 <a>
之间的 space 导致了此问题。
有没有办法在 .find 中解决这个问题?有没有更聪明的方法来访问维基百科中的列表?
以后如何避免这种情况?
将 table
读入数据框的最简单方法是 read_html()
:
import pandas as pd
pd.read_html(wiki)[1]
输出:
Name Dates as aCategory 5 Duration as aCategory 5 Sustainedwind speeds Pressure Areas affected Deaths Damage(USD) Refs
0 "Cuba" October 19, 1924 12 hours 165 mph (270 km/h) 910 hPa (26.87 inHg) Central America, Mexico, CubaFlorida, The Bahamas 90 NaN [12]
1 "San Felipe IIOkeechobee" September 13–14, 1928 12 hours 160 mph (260 km/h) 929 hPa (27.43 inHg) Lesser Antilles, The BahamasUnited States East... 4000 NaN NaN
...
要改进您的示例,您可以执行以下操作:
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = requests.get(wiki).content
soup = BeautifulSoup(page,'lxml')
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
data = []
for row in table.select('tr')[1:-1]:
cells = []
for cell in row.select('td'):
cells.append(cell.get_text('',strip=True))
data.append(cells)
get_text('',strip=True)
将从 td
中获取文本并去除 front/end 中的 space。
这将 规范化 文本,并希望能为您提供所需的内容:-
import urllib
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
# The class of the list in wikipedia
table = soup.find('table', class_="wikitable sortable")
Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 9: # The start and end don't include a <td> tag
for i, cell in enumerate(cells):
Data[i].append(cell.text.strip().replace('"', ''))
print(Data)