如何使用 Beautifulsoup 和任何 excel 编写器(Pandas)在 html 元素中抓取信息并将其保存到 Excel 行?
How can I webscrape information in the html element and save it to an Excel row using Beautifulsoup and any excel writer(Pandas)?
我是 python 的新手,我正在为我的项目做这件事。有人可以帮我将它保存到 excel 文件吗?
多个站点 URL 需要这样做,因此需要将每个信息添加到 excel 中的新行中。下面附有示例 HTML 代码。请帮助我将它保存到 excel 行和列以及如何使用 for 循环对其进行迭代。
我建议您直接使用 openpyxl
而不是通过 Pandas,这样您可以更好地控制 Excel 文件的格式。
以下是如何在 Excel 文件中建立多行:
import requests
from bs4 import BeautifulSoup
from xlwt import Workbook
import openpyxl
from openpyxl.styles.borders import Border, Side
from openpyxl.utils import get_column_letter
from openpyxl.styles import Alignment
website_url = "https://www.example.com/"
res = requests.get(website_url, verify=False)
soup = BeautifulSoup(res.text, 'lxml')
Links = soup.find_all("a", {"class": "jobTitleLink"},)
url = [tag.get('href') for tag in Links]
wb = openpyxl.Workbook()
# Write a header row
columns = [
("SL No", 10),
("Job Title", 25),
("Company Name", 20),
("Posted on", 13),
("Closing on", 13),
("Location", 20),
("Description", 40),
("Skills", 70),
("Link Email", 30),
]
thin_border = Border(left=Side(style='thin'), right=Side(
style='thin'), top=Side(style='thin'), bottom=Side(style='thin'))
ws = wb.active
for col_number, (value, width), in enumerate(columns, start=1):
ws.cell(column=col_number, row=1, value=value).border = thin_border
ws.column_dimensions[get_column_letter(col_number)].width = width
row_number = 2
# get the first link in the entire page
# get value of the href attribute
for x in url[1:5]:
res = requests.get(f'https://www.example/com/{x}', verify=False)
soup = BeautifulSoup(res.text, 'lxml')
data = []
for div_block in soup.find_all('div', class_='block', style=None):
data.append([line.strip() for line in div_block.stripped_strings])
li_fr = soup.find('li', class_="fr")
company_name = li_fr.a.text
location = list(li_fr.find_next_sibling('li').stripped_strings)[1]
# Write a data row
row = [
'', # SL No
data[0][0], # Job title
company_name, # Company name
data[1][1],
data[2][1],
location,
data[4][1],
'\n'.join(data[5][1:]),
data[3][1],
]
for col_number, value in enumerate(row, start=1):
cell = ws.cell(column=col_number, row=row_number, value=value)
cell.border = thin_border
cell.alignment = Alignment(wrapText=True)
row_number += 1
wb.save('output.xlsx')
print('Saved all the data')
这会给你一个 Excel sheet 看起来像:
通过额外的工作,您可以应用您喜欢的任何样式。
我是 python 的新手,我正在为我的项目做这件事。有人可以帮我将它保存到 excel 文件吗?
多个站点 URL 需要这样做,因此需要将每个信息添加到 excel 中的新行中。下面附有示例 HTML 代码。请帮助我将它保存到 excel 行和列以及如何使用 for 循环对其进行迭代。
我建议您直接使用 openpyxl
而不是通过 Pandas,这样您可以更好地控制 Excel 文件的格式。
以下是如何在 Excel 文件中建立多行:
import requests
from bs4 import BeautifulSoup
from xlwt import Workbook
import openpyxl
from openpyxl.styles.borders import Border, Side
from openpyxl.utils import get_column_letter
from openpyxl.styles import Alignment
website_url = "https://www.example.com/"
res = requests.get(website_url, verify=False)
soup = BeautifulSoup(res.text, 'lxml')
Links = soup.find_all("a", {"class": "jobTitleLink"},)
url = [tag.get('href') for tag in Links]
wb = openpyxl.Workbook()
# Write a header row
columns = [
("SL No", 10),
("Job Title", 25),
("Company Name", 20),
("Posted on", 13),
("Closing on", 13),
("Location", 20),
("Description", 40),
("Skills", 70),
("Link Email", 30),
]
thin_border = Border(left=Side(style='thin'), right=Side(
style='thin'), top=Side(style='thin'), bottom=Side(style='thin'))
ws = wb.active
for col_number, (value, width), in enumerate(columns, start=1):
ws.cell(column=col_number, row=1, value=value).border = thin_border
ws.column_dimensions[get_column_letter(col_number)].width = width
row_number = 2
# get the first link in the entire page
# get value of the href attribute
for x in url[1:5]:
res = requests.get(f'https://www.example/com/{x}', verify=False)
soup = BeautifulSoup(res.text, 'lxml')
data = []
for div_block in soup.find_all('div', class_='block', style=None):
data.append([line.strip() for line in div_block.stripped_strings])
li_fr = soup.find('li', class_="fr")
company_name = li_fr.a.text
location = list(li_fr.find_next_sibling('li').stripped_strings)[1]
# Write a data row
row = [
'', # SL No
data[0][0], # Job title
company_name, # Company name
data[1][1],
data[2][1],
location,
data[4][1],
'\n'.join(data[5][1:]),
data[3][1],
]
for col_number, value in enumerate(row, start=1):
cell = ws.cell(column=col_number, row=row_number, value=value)
cell.border = thin_border
cell.alignment = Alignment(wrapText=True)
row_number += 1
wb.save('output.xlsx')
print('Saved all the data')
这会给你一个 Excel sheet 看起来像:
通过额外的工作,您可以应用您喜欢的任何样式。