如何使用 python 和 bs4 修复 scrape web table 输出 csv

How to fix scrape web table output csv with python and bs4

请帮帮我,, 我想在 "td"、"Barcode" 和 "nama produk" 中获取 2 个数据,但我得到的数据非常糟糕。我应该修复什么?

import csv
import requests
from bs4 import BeautifulSoup


outfile = open("dataaa.csv","w",newline='')
writer = csv.writer(outfile)


page = 0
while page < 3 :
    url = "http://ciumi.com/cspos/barcode-ritel.php?page={:d}".format(page)
    response = requests.get(url)
    tree = BeautifulSoup(response.text, 'html.parser')
    page += 1
    table_tag = tree.select("table")[0]
    tab_data = [[item.text for item in row_data.select("tr")]
    for row_data in table_tag.select("td")]
    for data in tab_data:
        writer.writerow(data)
        print(table_tag)
        print(response, url, ' '.join(data))


import fileinput
seen = set() 
for line in fileinput.FileInput('dataaa.csv', inplace=1):
    if line in seen: continue

    seen.add(line)
    print (line)

我需要改进什么才能获得漂亮的结果?

您可以使用 pandas 简化此操作。 Pandas 在后台使用 BeautifulSoup 来解析表格,顺便说一句:

import pandas as pd

results_df = pd.DataFrame()
for page in range(1,3):
    url = 'http://ciumi.com/cspos/barcode-ritel.php?page=%s' %page
    results_df = results_df.append(pd.read_html(url)[0], sort=True)

results_df.columns = ['Barcode', 'Nama Produk']
results_df = results_df.reset_index(drop=True)

results_df.to_csv('dataaa.csv', index=False)

输出:

print (results_df)
          Barcode                         Nama Produk
0   8992694242533             ZWITSAL SOAP 80G PACK 4
1   8992694247163         ZWITSAL SOAP 80G MILK&HONEY
2   8992694242502            ZWITSAL SOAP 80G CLASSIC
3   8992694245435  ZWITSAL SKIN GUARD LOT 100ML SPRAY
4   8992694246074               ZWITSAL SHP 600ML C&R
5   8992694242908             ZWITSAL SHP 50ML REBORN
6   8992694020025          ZWITSAL SHP 500ML REF AVKS
7   8992694246333           ZWITSAL SHP 500ML C&R REF
8   8992694246364              ZWITSAL SHP 300ML AVKS
9   8992694246319       ZWITSAL SHP 250ML REF CLEAN&R
10  8992694246357          ZWITSAL SHP 250ML REF AVKS
11  8992694242922            ZWITSAL SHP 200ML REBORN
12  8992694242915           ZWITSAL SHP 100ML CLASSIC
13  8992694246340              ZWITSAL SHP 100ML AVKS
14  8992694242601          ZWITSAL PWD 50G SOFTFLOWER
15  8992694244254               ZWITSAL PWD 50G FRESH
16  8992694242656         ZWITSAL PWD 500G SOFTFLORAL
17  8992694241055            ZWITSAL PWD 500G FRESH F
18  8992694244056        ZWITSAL PWD 300G SOFT FLORAL
19  8992694244513         ZWITSAL PWD 300G MILK&HONEY

看起来页面是从 1 开始的,所以我的范围循环从那里开始。然后您可以使用 Session 对象来提高重新使用连接的效率。如果您明智地选择 css 选择器,则所有过滤都可以在该级别完成,然后您只处理检索到的必需元素。您可以使用更轻量级的 csv 而不是更重的 pandas 导入。

需要 bs4 4.7.1+ 作为杠杆 :has 伪选择器。


快速解释:

以下通过仅针对 center 个具有 type selector center

的元素来选择第一列条形码
soup.select('center')

然后

soup.select('td:has(center) + td')

通过使用 adjacent sibling combinator 选择第二列以获得右侧相邻的 table 单元格紧邻左侧 table 单元格 (td) 具有 center 子元素.

检索到的标签列表 .text 在列表推导中提取和剥离,然后将它们压缩并再次转换为列表;并附加到最​​终列表 results ,稍后循环写入 csv.

css 选择器保持最小以允许更快的匹配。


import requests, csv
from bs4 import BeautifulSoup as bs

results = []

with requests.Session() as s:
    for page in range(1,4):   #pages start at 1 and assuming you actually want first 3
        r = s.get(f'http://ciumi.com/cspos/barcode-ritel.php?page={page}')
        soup = bs(r.content, 'lxml')
        results += list(zip([i.text.strip() for i in soup.select('center')] , [i.text.strip() for i in soup.select('td:has(center) + td')]))

with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['Barcode','Nama Produk'])
    for line in results:
        w.writerow(line)

补充阅读:

  1. css selectors