使用 beautifulsoup 和 pandas 抓取时如何将行附加到 xlsx 文件?
How do you append rows to xlsx file when using beautifulsoup and pandas to scrape?
所以,我一直在寻找,但我似乎无法弄清楚为什么我无法从我的抓取中获取结果以写入 xlsx 文件。
我运行正在从 .csv 文件中获取 url 的列表。我在那里扔了 10 urls,beautifulsoup 刮掉了它们。如果我只是打印数据框,它就来了。
如果我尝试将结果保存为 xlsx(首选)或 csv,它只会给我上次 url 的结果。
如果我运行这个,它打印出完美
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
但我似乎无法将其附加到 xlsx 文件。我只得到最后的结果,我认为这是因为它只是 "writing" 而不是追加。
我试过:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
和
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
和
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
和
我开始阅读openpyxl,但在这一点上我很困惑,我不明白。
感谢任何帮助
您正在逐行迭代 csv 数据,但每次迭代都在重新创建数据框,因此每次都会丢失前一个数据框的值。您需要先在循环外创建 df,然后在 for 循环中添加数据。
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
你写而不是追加的假设是正确的,但你需要追加数据帧然后将其写入excel,而不是将数据追加到excel(如果我理解正确的话) .
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
pd.ExcelWriter 只会在您 运行:
时产生输出
writer.save()
我有一个类似的代码,它使用以下参数打开文件并且它有效:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
请注意,根据 documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" 文档与另一个数据框的文档结构格式完全相同。
为了使其可重现,您可以添加一个模板 xlsx,但我希望它能有所帮助。请告诉我。
我运行正在从 .csv 文件中获取 url 的列表。我在那里扔了 10 urls,beautifulsoup 刮掉了它们。如果我只是打印数据框,它就来了。
如果我尝试将结果保存为 xlsx(首选)或 csv,它只会给我上次 url 的结果。
如果我运行这个,它打印出完美
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
但我似乎无法将其附加到 xlsx 文件。我只得到最后的结果,我认为这是因为它只是 "writing" 而不是追加。
我试过:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
和
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
和
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
和
我开始阅读openpyxl,但在这一点上我很困惑,我不明白。
感谢任何帮助
您正在逐行迭代 csv 数据,但每次迭代都在重新创建数据框,因此每次都会丢失前一个数据框的值。您需要先在循环外创建 df,然后在 for 循环中添加数据。
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
你写而不是追加的假设是正确的,但你需要追加数据帧然后将其写入excel,而不是将数据追加到excel(如果我理解正确的话) .
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
pd.ExcelWriter 只会在您 运行:
时产生输出writer.save()
我有一个类似的代码,它使用以下参数打开文件并且它有效:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
请注意,根据 documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" 文档与另一个数据框的文档结构格式完全相同。 为了使其可重现,您可以添加一个模板 xlsx,但我希望它能有所帮助。请告诉我。