BeautifulSoup 在 Python 中只解析了一列而不是整个维基百科 table
BeautifulSoup parsed only one Column instead of entire Wikipedia table in Python
我正在尝试使用 Python 中的 BeautifulSoup 解析位于 here 的第一个 table。它解析了我的第一列,但由于某种原因它没有解析整个 table。感谢您的帮助!
注意: 我正在尝试解析整个 table 并转换为 pandas 数据帧
我的代码:
import requests
from bs4 import BeautifulSoup
WIKI_URL = requests.get("https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records").text
soup = BeautifulSoup(WIKI_URL, features="lxml")
print(soup.prettify())
my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('a')
print(links)
它只解析了一列,因为您只对第一列中的项目进行了查找。要解析整个 table,您必须对 table 行 <tr>
进行查找,然后在每行中对 table 进行查找 <td>
.现在您只是对链接进行查找,然后打印链接。
my_table = soup.find('table',{'class':'wikitable sortable'})
for row in mytable.findAll('tr'):
print(','.join([td.get_text(strip=True) for td in row.findAll('td')]))
注意:接受 B.Adler 的解决方案,因为这是很好的工作和合理的建议。这个解决方案很简单,因此您可以在学习时看到一些替代方案。
每当我看到 <table>
标签时,我通常会先查看 pandas,看看是否可以通过这种方式从 table 中找到我需要的东西。 pd.read_html()
将 return 数据帧列表,您可以 work/manipulate 从中提取您需要的内容。
import pandas as pd
WIKI_URL = "https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records"
tables = pd.read_html(WIKI_URL)
您还可以浏览数据框,看看哪个有您想要的数据。
我刚刚在索引位置 2 中使用了数据帧,这是您要查找的第一个 table
table = tables[2]
输出:
print (table)
0 1 ... 6 7
0 Team Won ... Total Games Conference
1 Michigan 953 ... 1331 Big Ten
2 Ohio State 1 911 ... 1289 Big Ten
3 Notre Dame 2 897 ... 1263 Independent
4 Boise State 448 ... 618 Mountain West
5 Alabama 3 905 ... 1277 SEC
6 Oklahoma 896 ... 1274 Big 12
7 Texas 908 ... 1311 Big 12
8 USC 4 839 ... 1239 Pac-12
9 Nebraska 897 ... 1325 Big Ten
10 Penn State 887 ... 1319 Big Ten
11 Tennessee 838 ... 1281 SEC
12 Florida State 5 544 ... 818 ACC
13 Georgia 819 ... 1296 SEC
14 LSU 797 ... 1259 SEC
15 Appalachian State 617 ... 981 Sun Belt
16 Georgia Southern 387 ... 616 Sun Belt
17 Miami (FL) 630 ... 1009 ACC
18 Auburn 759 ... 1242 SEC
19 Florida 724 ... 1182 SEC
20 Old Dominion 76 ... 121 C-USA
21 Coastal Carolina 112 ... 180 Sun Belt
22 Washington 735 ... 1234 Pac-12
23 Clemson 744 ... 1248 ACC
24 Virginia Tech 743 ... 1262 ACC
25 Arizona State 614 ... 1032 Pac-12
26 Texas A&M 741 ... 1270 SEC
27 Michigan State 701 ... 1204 Big Ten
28 West Virginia 750 ... 1292 Big 12
29 Miami (OH) 690 ... 1195 MAC
.. ... ... ... ... ...
101 Memphis 482 ... 1026 The American
102 Kansas 582 ... 1271 Big 12
103 Wyoming 526 ... 1122 Mountain West
104 Louisiana 510 ... 1098 Sun Belt
105 Colorado State 520 ... 1124 Mountain West
106 Connecticut 508 ... 1107 The American
107 SMU 489 ... 1083 The American
108 Oregon State 530 ... 1173 Pac-12
109 UTSA 38 ... 82 C-USA
110 Kansas State 526 ... 1207 Big 12
111 New Mexico 483 ... 1103 Mountain West
112 Temple 468 ... 1094 The American
113 Iowa State 524 ... 1214 Big 12
114 Tulane 520 ... 1197 The American
115 Northwestern 535 ... 1240 Big Ten
116 UAB 126 ... 284 C-USA
117 Rice 470 ... 1108 C-USA
118 Eastern Michigan 453 ... 1089 MAC
119 Louisiana-Monroe 304 ... 727 Sun Belt
120 Florida Atlantic 87 ... 205 C-USA
121 Indiana 479 ... 1195 Big Ten
122 Buffalo 370 ... 922 MAC
123 Wake Forest 450 ... 1136 ACC
124 New Mexico State 430 ... 1090 Independent
125 UTEP 390 ... 1005 C-USA
126 UNLV11 228 ... 574 Mountain West
127 Kent State 341 ... 922 MAC
128 FIU 64 ... 191 C-USA
129 Charlotte 20 ... 65 C-USA
130 Georgia State 27 ... 94 Sun Belt
[131 rows x 8 columns]
我正在尝试使用 Python 中的 BeautifulSoup 解析位于 here 的第一个 table。它解析了我的第一列,但由于某种原因它没有解析整个 table。感谢您的帮助!
注意: 我正在尝试解析整个 table 并转换为 pandas 数据帧
我的代码:
import requests
from bs4 import BeautifulSoup
WIKI_URL = requests.get("https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records").text
soup = BeautifulSoup(WIKI_URL, features="lxml")
print(soup.prettify())
my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('a')
print(links)
它只解析了一列,因为您只对第一列中的项目进行了查找。要解析整个 table,您必须对 table 行 <tr>
进行查找,然后在每行中对 table 进行查找 <td>
.现在您只是对链接进行查找,然后打印链接。
my_table = soup.find('table',{'class':'wikitable sortable'})
for row in mytable.findAll('tr'):
print(','.join([td.get_text(strip=True) for td in row.findAll('td')]))
注意:接受 B.Adler 的解决方案,因为这是很好的工作和合理的建议。这个解决方案很简单,因此您可以在学习时看到一些替代方案。
每当我看到 <table>
标签时,我通常会先查看 pandas,看看是否可以通过这种方式从 table 中找到我需要的东西。 pd.read_html()
将 return 数据帧列表,您可以 work/manipulate 从中提取您需要的内容。
import pandas as pd
WIKI_URL = "https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records"
tables = pd.read_html(WIKI_URL)
您还可以浏览数据框,看看哪个有您想要的数据。 我刚刚在索引位置 2 中使用了数据帧,这是您要查找的第一个 table
table = tables[2]
输出:
print (table)
0 1 ... 6 7
0 Team Won ... Total Games Conference
1 Michigan 953 ... 1331 Big Ten
2 Ohio State 1 911 ... 1289 Big Ten
3 Notre Dame 2 897 ... 1263 Independent
4 Boise State 448 ... 618 Mountain West
5 Alabama 3 905 ... 1277 SEC
6 Oklahoma 896 ... 1274 Big 12
7 Texas 908 ... 1311 Big 12
8 USC 4 839 ... 1239 Pac-12
9 Nebraska 897 ... 1325 Big Ten
10 Penn State 887 ... 1319 Big Ten
11 Tennessee 838 ... 1281 SEC
12 Florida State 5 544 ... 818 ACC
13 Georgia 819 ... 1296 SEC
14 LSU 797 ... 1259 SEC
15 Appalachian State 617 ... 981 Sun Belt
16 Georgia Southern 387 ... 616 Sun Belt
17 Miami (FL) 630 ... 1009 ACC
18 Auburn 759 ... 1242 SEC
19 Florida 724 ... 1182 SEC
20 Old Dominion 76 ... 121 C-USA
21 Coastal Carolina 112 ... 180 Sun Belt
22 Washington 735 ... 1234 Pac-12
23 Clemson 744 ... 1248 ACC
24 Virginia Tech 743 ... 1262 ACC
25 Arizona State 614 ... 1032 Pac-12
26 Texas A&M 741 ... 1270 SEC
27 Michigan State 701 ... 1204 Big Ten
28 West Virginia 750 ... 1292 Big 12
29 Miami (OH) 690 ... 1195 MAC
.. ... ... ... ... ...
101 Memphis 482 ... 1026 The American
102 Kansas 582 ... 1271 Big 12
103 Wyoming 526 ... 1122 Mountain West
104 Louisiana 510 ... 1098 Sun Belt
105 Colorado State 520 ... 1124 Mountain West
106 Connecticut 508 ... 1107 The American
107 SMU 489 ... 1083 The American
108 Oregon State 530 ... 1173 Pac-12
109 UTSA 38 ... 82 C-USA
110 Kansas State 526 ... 1207 Big 12
111 New Mexico 483 ... 1103 Mountain West
112 Temple 468 ... 1094 The American
113 Iowa State 524 ... 1214 Big 12
114 Tulane 520 ... 1197 The American
115 Northwestern 535 ... 1240 Big Ten
116 UAB 126 ... 284 C-USA
117 Rice 470 ... 1108 C-USA
118 Eastern Michigan 453 ... 1089 MAC
119 Louisiana-Monroe 304 ... 727 Sun Belt
120 Florida Atlantic 87 ... 205 C-USA
121 Indiana 479 ... 1195 Big Ten
122 Buffalo 370 ... 922 MAC
123 Wake Forest 450 ... 1136 ACC
124 New Mexico State 430 ... 1090 Independent
125 UTEP 390 ... 1005 C-USA
126 UNLV11 228 ... 574 Mountain West
127 Kent State 341 ... 922 MAC
128 FIU 64 ... 191 C-USA
129 Charlotte 20 ... 65 C-USA
130 Georgia State 27 ... 94 Sun Belt
[131 rows x 8 columns]