使用 BeautifulSoup 从网页中抓取数据 - 如何将数据附加到现有数据框？

Question

使用以下代码我尝试从网站抓取数据（参考：https://towardsdatascience.com/web-scraping-scraping-table-data-1665b6b2271c）：

df = pd.DataFrame(columns=headings)
for i in range (102,158):
    URL = 'http://BuLidata.xyz/'
    URL_ = URL + 'B100' +str(i+1) + '.html'
    r = urllib.request.urlopen(URL_).read()
    soup = BeautifulSoup(r,'lxml')
    table = soup.find('table' ,attrs={'class':'abschluss'})
    body = table.find_all("tr")
    head = body[0]
    body_rows = body[1:]
    headings = []
    for item in head.find_all('th'):
        item = (item.text).rstrip('\n')
        headings.append(item)
    all_rows = [] 
    for row_num in range(len(body_rows)):
        row = []
        for row_item in body_rows[row_num].find_all("td"):
            aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
            row.append(aa)
        all_rows.append(row)
    df1 = pd.DataFrame(data=all_rows,columns=headings)
    df.append(df1, ignore_index=True)

我 'intialized' 数据框作为一个空数据框，只有正确的列名，然后尝试使用循环来遍历网站上的数据。部分它似乎有效，因为 df1 是最后一个网站 link 的数据。但是 df 仍然是初始化的空数据框。我想知道我在这里做错了什么？

Answer 1

这不是附加到数据框的最佳策略。请改用 python 数据结构，例如 list 或 dict，然后在循环结束时将它们连接起来以获取您的数据框：

data = []
for i in range(102, 158)
    # do stuff here
    df1 = ...
    data.append(df1)
df = pd.concat(data, ignore_index=True)

输出：

>>> df
     Platz             Mannschaft Spiele    S-U-N        Tore Pkt.         Statistik
0       1.       TSV 1860 München     34  20-10-4  80:40(+40)   50  Saison 1965/1966
1       2.      Borussia Dortmund     34   19-9-6  70:36(+34)   47  Saison 1965/1966
2       3.         Bayern München     34   20-7-7  71:38(+33)   47  Saison 1965/1966
3       4.          Werder Bremen     34  21-3-10  76:40(+36)   45  Saison 1965/1966
4       5.             1. FC Köln     34   19-6-9  74:41(+33)   44  Saison 1965/1966
...    ...                    ...    ...      ...         ...  ...               ...
1005   14.             Hertha BSC     34  8-11-15  41:52(-11)   35  Saison 2020/2021
1006   15.  DSC Arminia Bielefeld     34   9-8-17  26:52(-26)   35  Saison 2020/2021
1007   16.             1. FC Köln     34   8-9-17  34:60(-26)   33  Saison 2020/2021
1008   17.       SV Werder Bremen     34  7-10-17  36:57(-21)   31  Saison 2020/2021
1009   18.          FC Schalke 04     34   3-7-24  25:86(-61)   16  Saison 2020/2021

[1010 rows x 7 columns]

使用 BeautifulSoup 从网页中抓取数据 - 如何将数据附加到现有数据框？

Scrape data from webpage with BeautifulSoup - How to append data to existing dataframe?

python

for-loop

beautifulsoup

web-scraping

pandas