如何使用 python 和 bs4 修复 scrape web table 输出 csv
How to fix scrape web table output csv with python and bs4
请帮帮我,,
我想在 "td"、"Barcode" 和 "nama produk" 中获取 2 个数据,但我得到的数据非常糟糕。我应该修复什么?
import csv
import requests
from bs4 import BeautifulSoup
outfile = open("dataaa.csv","w",newline='')
writer = csv.writer(outfile)
page = 0
while page < 3 :
url = "http://ciumi.com/cspos/barcode-ritel.php?page={:d}".format(page)
response = requests.get(url)
tree = BeautifulSoup(response.text, 'html.parser')
page += 1
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("tr")]
for row_data in table_tag.select("td")]
for data in tab_data:
writer.writerow(data)
print(table_tag)
print(response, url, ' '.join(data))
import fileinput
seen = set()
for line in fileinput.FileInput('dataaa.csv', inplace=1):
if line in seen: continue
seen.add(line)
print (line)
我需要改进什么才能获得漂亮的结果?
您可以使用 pandas 简化此操作。 Pandas 在后台使用 BeautifulSoup 来解析表格,顺便说一句:
import pandas as pd
results_df = pd.DataFrame()
for page in range(1,3):
url = 'http://ciumi.com/cspos/barcode-ritel.php?page=%s' %page
results_df = results_df.append(pd.read_html(url)[0], sort=True)
results_df.columns = ['Barcode', 'Nama Produk']
results_df = results_df.reset_index(drop=True)
results_df.to_csv('dataaa.csv', index=False)
输出:
print (results_df)
Barcode Nama Produk
0 8992694242533 ZWITSAL SOAP 80G PACK 4
1 8992694247163 ZWITSAL SOAP 80G MILK&HONEY
2 8992694242502 ZWITSAL SOAP 80G CLASSIC
3 8992694245435 ZWITSAL SKIN GUARD LOT 100ML SPRAY
4 8992694246074 ZWITSAL SHP 600ML C&R
5 8992694242908 ZWITSAL SHP 50ML REBORN
6 8992694020025 ZWITSAL SHP 500ML REF AVKS
7 8992694246333 ZWITSAL SHP 500ML C&R REF
8 8992694246364 ZWITSAL SHP 300ML AVKS
9 8992694246319 ZWITSAL SHP 250ML REF CLEAN&R
10 8992694246357 ZWITSAL SHP 250ML REF AVKS
11 8992694242922 ZWITSAL SHP 200ML REBORN
12 8992694242915 ZWITSAL SHP 100ML CLASSIC
13 8992694246340 ZWITSAL SHP 100ML AVKS
14 8992694242601 ZWITSAL PWD 50G SOFTFLOWER
15 8992694244254 ZWITSAL PWD 50G FRESH
16 8992694242656 ZWITSAL PWD 500G SOFTFLORAL
17 8992694241055 ZWITSAL PWD 500G FRESH F
18 8992694244056 ZWITSAL PWD 300G SOFT FLORAL
19 8992694244513 ZWITSAL PWD 300G MILK&HONEY
看起来页面是从 1 开始的,所以我的范围循环从那里开始。然后您可以使用 Session 对象来提高重新使用连接的效率。如果您明智地选择 css 选择器,则所有过滤都可以在该级别完成,然后您只处理检索到的必需元素。您可以使用更轻量级的 csv
而不是更重的 pandas
导入。
需要 bs4 4.7.1+ 作为杠杆 :has
伪选择器。
快速解释:
以下通过仅针对 center
个具有 type selector center
的元素来选择第一列条形码
soup.select('center')
然后
soup.select('td:has(center) + td')
通过使用 adjacent sibling combinator 选择第二列以获得右侧相邻的 table 单元格紧邻左侧 table 单元格 (td) 具有 center
子元素.
检索到的标签列表 .text
在列表推导中提取和剥离,然后将它们压缩并再次转换为列表;并附加到最终列表 results
,稍后循环写入 csv.
css 选择器保持最小以允许更快的匹配。
import requests, csv
from bs4 import BeautifulSoup as bs
results = []
with requests.Session() as s:
for page in range(1,4): #pages start at 1 and assuming you actually want first 3
r = s.get(f'http://ciumi.com/cspos/barcode-ritel.php?page={page}')
soup = bs(r.content, 'lxml')
results += list(zip([i.text.strip() for i in soup.select('center')] , [i.text.strip() for i in soup.select('td:has(center) + td')]))
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Barcode','Nama Produk'])
for line in results:
w.writerow(line)
补充阅读:
请帮帮我,, 我想在 "td"、"Barcode" 和 "nama produk" 中获取 2 个数据,但我得到的数据非常糟糕。我应该修复什么?
import csv
import requests
from bs4 import BeautifulSoup
outfile = open("dataaa.csv","w",newline='')
writer = csv.writer(outfile)
page = 0
while page < 3 :
url = "http://ciumi.com/cspos/barcode-ritel.php?page={:d}".format(page)
response = requests.get(url)
tree = BeautifulSoup(response.text, 'html.parser')
page += 1
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("tr")]
for row_data in table_tag.select("td")]
for data in tab_data:
writer.writerow(data)
print(table_tag)
print(response, url, ' '.join(data))
import fileinput
seen = set()
for line in fileinput.FileInput('dataaa.csv', inplace=1):
if line in seen: continue
seen.add(line)
print (line)
我需要改进什么才能获得漂亮的结果?
您可以使用 pandas 简化此操作。 Pandas 在后台使用 BeautifulSoup 来解析表格,顺便说一句:
import pandas as pd
results_df = pd.DataFrame()
for page in range(1,3):
url = 'http://ciumi.com/cspos/barcode-ritel.php?page=%s' %page
results_df = results_df.append(pd.read_html(url)[0], sort=True)
results_df.columns = ['Barcode', 'Nama Produk']
results_df = results_df.reset_index(drop=True)
results_df.to_csv('dataaa.csv', index=False)
输出:
print (results_df)
Barcode Nama Produk
0 8992694242533 ZWITSAL SOAP 80G PACK 4
1 8992694247163 ZWITSAL SOAP 80G MILK&HONEY
2 8992694242502 ZWITSAL SOAP 80G CLASSIC
3 8992694245435 ZWITSAL SKIN GUARD LOT 100ML SPRAY
4 8992694246074 ZWITSAL SHP 600ML C&R
5 8992694242908 ZWITSAL SHP 50ML REBORN
6 8992694020025 ZWITSAL SHP 500ML REF AVKS
7 8992694246333 ZWITSAL SHP 500ML C&R REF
8 8992694246364 ZWITSAL SHP 300ML AVKS
9 8992694246319 ZWITSAL SHP 250ML REF CLEAN&R
10 8992694246357 ZWITSAL SHP 250ML REF AVKS
11 8992694242922 ZWITSAL SHP 200ML REBORN
12 8992694242915 ZWITSAL SHP 100ML CLASSIC
13 8992694246340 ZWITSAL SHP 100ML AVKS
14 8992694242601 ZWITSAL PWD 50G SOFTFLOWER
15 8992694244254 ZWITSAL PWD 50G FRESH
16 8992694242656 ZWITSAL PWD 500G SOFTFLORAL
17 8992694241055 ZWITSAL PWD 500G FRESH F
18 8992694244056 ZWITSAL PWD 300G SOFT FLORAL
19 8992694244513 ZWITSAL PWD 300G MILK&HONEY
看起来页面是从 1 开始的,所以我的范围循环从那里开始。然后您可以使用 Session 对象来提高重新使用连接的效率。如果您明智地选择 css 选择器,则所有过滤都可以在该级别完成,然后您只处理检索到的必需元素。您可以使用更轻量级的 csv
而不是更重的 pandas
导入。
需要 bs4 4.7.1+ 作为杠杆 :has
伪选择器。
快速解释:
以下通过仅针对 center
个具有 type selector center
soup.select('center')
然后
soup.select('td:has(center) + td')
通过使用 adjacent sibling combinator 选择第二列以获得右侧相邻的 table 单元格紧邻左侧 table 单元格 (td) 具有 center
子元素.
检索到的标签列表 .text
在列表推导中提取和剥离,然后将它们压缩并再次转换为列表;并附加到最终列表 results
,稍后循环写入 csv.
css 选择器保持最小以允许更快的匹配。
import requests, csv
from bs4 import BeautifulSoup as bs
results = []
with requests.Session() as s:
for page in range(1,4): #pages start at 1 and assuming you actually want first 3
r = s.get(f'http://ciumi.com/cspos/barcode-ritel.php?page={page}')
soup = bs(r.content, 'lxml')
results += list(zip([i.text.strip() for i in soup.select('center')] , [i.text.strip() for i in soup.select('td:has(center) + td')]))
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Barcode','Nama Produk'])
for line in results:
w.writerow(line)
补充阅读: