使用 for 循环中的数据框和 xlsxwriter 将整个 Beautifulsoup 数组保存到 excel

Saving whole Beautifulsoup array into excel using dataframe and xlsxwriter inside for loop

在查阅了大量文档并在 Whosebug 上寻找答案后,我找不到解决问题的方法。

基本上我使用 beautifulsoup 从网站抓取数据列表,然后将其存储到 excel。抓取工作正常。

当我 运行 我的脚本时,它会将所有项目打印到终端。但是,当我尝试将此结果保存到数据框中并将其保存到 Excel 时,它只会执行最后一行并将该行保存到 excel。

我试过将代码存储在循环中,但结果相同。 我试过将列表转换回 for 循环内的数组,但同样的问题。最后一行只保存到 Excel

我想我在这里缺少一种合乎逻辑的方法。如果有人可以 link 我要寻找什么,我将不胜感激。

        soup = BeautifulSoup(html, features="lxml")
        soup.find_all("div", {"id":"tbl-lock"})

        for listing in soup.find_all('tr'):

            listing.attrs = {}

            assetTime = listing.find_all("td", {"class": "locked"})
            assetCell = listing.find_all("td", {"class": "assetCell"})
            assetValue = listing.find_all("td", {"class": "assetValue"})

            for data in assetCell:

                array = [data.get_text()]

                ### Excel Heading + data
                df = pd.DataFrame({'Cell': array
                                    })
               print(array)
                # In here it will print all of the data


        ### Now we need to save the data to excel
        ### Create a Pandas Excel writer using XlsxWriter as the Engine
        writer = pd.ExcelWriter(filename+'.xlsx', engine='xlsxwriter')

        ### Convert the dataframe to an XlsxWriter Excel object and skip first row for custom header
        df.to_excel(writer, sheet_name='SheetName', startrow=1, header=False)

        ### Get the xlsxwritert workbook and worksheet objects

        workbook = writer.book
        worksheet = writer.sheets['SheetName']

        ### Custom header for Excel
        header_format = workbook.add_format({
            'bold': True,
            'text_wrap': True,
            'valign': 'top',
            'fg_color': '#D7E4BC',
            'border': 1
        })

        ### Write the column headers with the defined add_format
        print(df) ### In here it will print only 1 line
        for col_num, value in enumerate(df):

            worksheet.write(0, col_num +1, value, header_format)

            ### Close Pandas Excel writer and output the Excel file
            writer.save()

这一行就是问题df = pd.DataFrame({'Cell': array}) 此处您要覆盖 df,因此只存储最后一行。

而是将 df 初始化为 df = pd.DataFrame(columns=['cell']) 并在循环中执行此操作

df = df.append(pd.DataFrame({'Cell': array}),ignore_index=True)

编辑:

试试这个

soup = BeautifulSoup(html, features="lxml")
soup.find_all("div", {"id":"tbl-lock"})

df = pd.DataFrame(columns=['cell'])
for listing in soup.find_all('tr'):

        listing.attrs = {}

        assetTime = listing.find_all("td", {"class": "locked"})
        assetCell = listing.find_all("td", {"class": "assetCell"})
        assetValue = listing.find_all("td", {"class": "assetValue"})

        for data in assetCell:

            array = [data.get_text()]

            ### Excel Heading + data
            df = df.append(pd.DataFrame({'Cell': array}),ignore_index=True)
            ##Or this
            #df = df.append(pd.DataFrame({'Cell': array}))   

            print(array)
            # In here it will print all of the data

。 . . . 其余代码