使用 BeautifulSoup 和请求以及 Pandas 从 <div> 中使用 <span> 抓取数据

Question

我正在尝试从 HTML 代码中提取 "T" 和“0-0”以及“(2 OT)”。我开始编写下面的代码，但新手太多，无法理解。谢谢你的帮助。


    <div class ="sidearm-schedule-game-details flex item-1 columns"> == [=14=]
        <div class="sidearm-schedule-game-result text-italic"> == [=14=]
            <span></span>
            <span>T,</span>
            <span>0-0</span>
            <span>(2 OT)</span>
        </div>


    import requests
    import pandas as pd
    from pandas import ExcelWriter
    from bs4 import BeautifulSoup


    url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
    school = requests.get(url).text
    soup = BeautifulSoup(school,'lxml')

    rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
        sheet = pd.DataFrame()
        for row in rows:
            result = row.find('div',class_="sidearm-schedule-game-result").text.strip()


            df = pd.DataFrame([[result]], columns=['result'])
            sheet = sheet.append(df,sort=True).reset_index(drop=True)

        results.append(sheet)

Answer 1

我认为您正在寻找类似的东西：

import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup


url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

sheet = pd.DataFrame()
for row in rows:
    result = row.find('div',class_="sidearm-schedule-game-result").text.strip().replace('\n', ', ')
    df = pd.DataFrame([[result]], columns=['result'])
    sheet = sheet.append(df).reset_index(drop=True)

这将使 sheet 的内容看起来像：

           result
0          L, 1-2
1     L, 1-2 (OT)
2          W, 1-0
3          W, 1-0
4          L, 1-2
5   W, 1-0 (2 OT)
6   T, 0-0 (2 OT)
7          W, 3-0
8     L, 2-3 (OT)
9     W, 2-1 (OT)
10         W, 1-0
11         W, 1-0
12         L, 0-1
13  T, 0-0 (2 OT)
14         L, 0-1
15         W, 1-0
16         L, 0-1
17         W, 3-1
18         L, 1-2

Answer 2

只要使用 xpath，我会做这样的事情：

    a = html.xpath('//div[@class, "sidearm-schedule-game-result"]')
    #select all nodes that start with a <div> and have "sidearm-schedule-game-result" in the class.
    for each in a:
         b = each.xpath('.//span/text()')
         #the './/' will only look at subelements of what you selected earlier and text() will extract the text from that field.
         print(b)

Answer 3

您可以使用 re 模块来解析 <span> 中的文本，并将每个信息存储在单独的列 Result、Score、OT 中。

例如：

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

data = []
for row in rows:
    opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
    name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']

    result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]

    data.append([opponent, *result, name_date])

df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)

打印：

                            Name Result Score      OT                                             Info
0      University of Connecticut      L   1-2                                UConn on August 24 7 p.m.
1              Drexel University      L   1-2    (OT)                       Drexel on August 27 7 p.m.
2   George Washington University      W   1-0                  George Washington on September 1 4 p.m.
3          St. John's University      W   1-0                      St. John's on September 4 7:30 p.m.
4          Binghamton University      L   1-2                         Binghamton on September 7 8 p.m.
5               Rider University      W   1-0  (2 OT)                     Rider on September 11 7 p.m.
6     University of Pennsylvania      T   0-0  (2 OT)                      Penn on September 15 6 p.m.
7                           Army      W   3-0                              Army on September 22 7 p.m.
8             Cornell University      L   2-3    (OT)                   Cornell on September 25 7 p.m.
9              Boston University      W   2-1    (OT)                  Boston U on September 29 4 p.m.
10            Colgate University      W   1-0                              Colgate on October 3 7 p.m.
11   United States Naval Academy      W   1-0                                 Navy on October 6 6 p.m.
12             Lafayette College      L   0-1                          Lafayette on October 13 12 p.m.
13             Dartmouth College      T   0-0  (2 OT)                   Dartmouth on October 16 6 p.m.
14           American University      L   0-1                            American on October 20 6 p.m.
15           Bucknell University      W   1-0                            Bucknell on October 24 7 p.m.
16       Loyola University (Md.)      L   0-1                        Loyola (Md.) on October 27 3 p.m.
17                    Holy Cross      W   3-1                          Holy Cross on November 3 6 p.m.
18            Colgate University      L   1-2          No. 3 Colgate (Semifinals) on November 9 7 p.m.

使用 BeautifulSoup 和请求以及 Pandas 从 <div> 中使用 <span> 抓取数据

scrape data from with a <span> within a <div> with BeautifulSoup and Requests and Pandas

beautifulsoup

python-3.x

pandas

python-requests-html