使用 BeautifulSoup 和请求以及 Pandas 从 <div> 中使用 <span> 抓取数据
scrape data from with a <span> within a <div> with BeautifulSoup and Requests and Pandas
我正在尝试从 HTML 代码中提取 "T" 和“0-0”以及“(2 OT)”。我开始编写下面的代码,但新手太多,无法理解。谢谢你的帮助。
<div class ="sidearm-schedule-game-details flex item-1 columns"> == [=14=]
<div class="sidearm-schedule-game-result text-italic"> == [=14=]
<span></span>
<span>T,</span>
<span>0-0</span>
<span>(2 OT)</span>
</div>
import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip()
df = pd.DataFrame([[result]], columns=['result'])
sheet = sheet.append(df,sort=True).reset_index(drop=True)
results.append(sheet)
我认为您正在寻找类似的东西:
import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip().replace('\n', ', ')
df = pd.DataFrame([[result]], columns=['result'])
sheet = sheet.append(df).reset_index(drop=True)
这将使 sheet
的内容看起来像:
result
0 L, 1-2
1 L, 1-2 (OT)
2 W, 1-0
3 W, 1-0
4 L, 1-2
5 W, 1-0 (2 OT)
6 T, 0-0 (2 OT)
7 W, 3-0
8 L, 2-3 (OT)
9 W, 2-1 (OT)
10 W, 1-0
11 W, 1-0
12 L, 0-1
13 T, 0-0 (2 OT)
14 L, 0-1
15 W, 1-0
16 L, 0-1
17 W, 3-1
18 L, 1-2
只要使用 xpath,我会做这样的事情:
a = html.xpath('//div[@class, "sidearm-schedule-game-result"]')
#select all nodes that start with a <div> and have "sidearm-schedule-game-result" in the class.
for each in a:
b = each.xpath('.//span/text()')
#the './/' will only look at subelements of what you selected earlier and text() will extract the text from that field.
print(b)
您可以使用 re
模块来解析 <span>
中的文本,并将每个信息存储在单独的列 Result
、Score
、OT
中。
例如:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
data = []
for row in rows:
opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']
result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]
data.append([opponent, *result, name_date])
df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)
打印:
Name Result Score OT Info
0 University of Connecticut L 1-2 UConn on August 24 7 p.m.
1 Drexel University L 1-2 (OT) Drexel on August 27 7 p.m.
2 George Washington University W 1-0 George Washington on September 1 4 p.m.
3 St. John's University W 1-0 St. John's on September 4 7:30 p.m.
4 Binghamton University L 1-2 Binghamton on September 7 8 p.m.
5 Rider University W 1-0 (2 OT) Rider on September 11 7 p.m.
6 University of Pennsylvania T 0-0 (2 OT) Penn on September 15 6 p.m.
7 Army W 3-0 Army on September 22 7 p.m.
8 Cornell University L 2-3 (OT) Cornell on September 25 7 p.m.
9 Boston University W 2-1 (OT) Boston U on September 29 4 p.m.
10 Colgate University W 1-0 Colgate on October 3 7 p.m.
11 United States Naval Academy W 1-0 Navy on October 6 6 p.m.
12 Lafayette College L 0-1 Lafayette on October 13 12 p.m.
13 Dartmouth College T 0-0 (2 OT) Dartmouth on October 16 6 p.m.
14 American University L 0-1 American on October 20 6 p.m.
15 Bucknell University W 1-0 Bucknell on October 24 7 p.m.
16 Loyola University (Md.) L 0-1 Loyola (Md.) on October 27 3 p.m.
17 Holy Cross W 3-1 Holy Cross on November 3 6 p.m.
18 Colgate University L 1-2 No. 3 Colgate (Semifinals) on November 9 7 p.m.
我正在尝试从 HTML 代码中提取 "T" 和“0-0”以及“(2 OT)”。我开始编写下面的代码,但新手太多,无法理解。谢谢你的帮助。
<div class ="sidearm-schedule-game-details flex item-1 columns"> == [=14=]
<div class="sidearm-schedule-game-result text-italic"> == [=14=]
<span></span>
<span>T,</span>
<span>0-0</span>
<span>(2 OT)</span>
</div>
import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip()
df = pd.DataFrame([[result]], columns=['result'])
sheet = sheet.append(df,sort=True).reset_index(drop=True)
results.append(sheet)
我认为您正在寻找类似的东西:
import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip().replace('\n', ', ')
df = pd.DataFrame([[result]], columns=['result'])
sheet = sheet.append(df).reset_index(drop=True)
这将使 sheet
的内容看起来像:
result
0 L, 1-2
1 L, 1-2 (OT)
2 W, 1-0
3 W, 1-0
4 L, 1-2
5 W, 1-0 (2 OT)
6 T, 0-0 (2 OT)
7 W, 3-0
8 L, 2-3 (OT)
9 W, 2-1 (OT)
10 W, 1-0
11 W, 1-0
12 L, 0-1
13 T, 0-0 (2 OT)
14 L, 0-1
15 W, 1-0
16 L, 0-1
17 W, 3-1
18 L, 1-2
只要使用 xpath,我会做这样的事情:
a = html.xpath('//div[@class, "sidearm-schedule-game-result"]')
#select all nodes that start with a <div> and have "sidearm-schedule-game-result" in the class.
for each in a:
b = each.xpath('.//span/text()')
#the './/' will only look at subelements of what you selected earlier and text() will extract the text from that field.
print(b)
您可以使用 re
模块来解析 <span>
中的文本,并将每个信息存储在单独的列 Result
、Score
、OT
中。
例如:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
data = []
for row in rows:
opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']
result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]
data.append([opponent, *result, name_date])
df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)
打印:
Name Result Score OT Info
0 University of Connecticut L 1-2 UConn on August 24 7 p.m.
1 Drexel University L 1-2 (OT) Drexel on August 27 7 p.m.
2 George Washington University W 1-0 George Washington on September 1 4 p.m.
3 St. John's University W 1-0 St. John's on September 4 7:30 p.m.
4 Binghamton University L 1-2 Binghamton on September 7 8 p.m.
5 Rider University W 1-0 (2 OT) Rider on September 11 7 p.m.
6 University of Pennsylvania T 0-0 (2 OT) Penn on September 15 6 p.m.
7 Army W 3-0 Army on September 22 7 p.m.
8 Cornell University L 2-3 (OT) Cornell on September 25 7 p.m.
9 Boston University W 2-1 (OT) Boston U on September 29 4 p.m.
10 Colgate University W 1-0 Colgate on October 3 7 p.m.
11 United States Naval Academy W 1-0 Navy on October 6 6 p.m.
12 Lafayette College L 0-1 Lafayette on October 13 12 p.m.
13 Dartmouth College T 0-0 (2 OT) Dartmouth on October 16 6 p.m.
14 American University L 0-1 American on October 20 6 p.m.
15 Bucknell University W 1-0 Bucknell on October 24 7 p.m.
16 Loyola University (Md.) L 0-1 Loyola (Md.) on October 27 3 p.m.
17 Holy Cross W 3-1 Holy Cross on November 3 6 p.m.
18 Colgate University L 1-2 No. 3 Colgate (Semifinals) on November 9 7 p.m.