input/print 在 python url 从 csv 文件抓取时保持一致性
Keep consistency while input/print in python url scraping from a csv file
我需要你的帮助来解决这个问题:
我这里有一个有效的 python 脚本:
from bs4 import BeautifulSoup
import requests
import csv
with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile, delimiter=';')
writer = csv.writer(results)
for row in reader:
# get the url
url = row[0]
# fetch content from server
html = requests.get(url).content
# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", {"class": "productsPicture"})
if divTag:
tags = divTag.findAll("a")
else:
continue
for tag in tags:
res = tag.get('href')
if res != None:
writer.writerow([res])
来源:
基本上我需要更改的原因是如何逐行保持输入和输出的一致性。见下文:
所有这一切背后的想法是 get/print 重定向 link,如果工作 link - 打印 link,如果不工作,打印 错误 link 左右
urls.csv样本
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193; - non valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093; - non valid
您只需要在使用 csv.writerow()
函数编写的列表中添加更多项目:
from bs4 import BeautifulSoup
import requests
import csv
with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile)
writer = csv.writer(results)
for row in reader:
# get the url
for url in row:
url = url.strip()
# Skip any empty URLs
if len(url):
print(url)
# fetch content from server
try:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
writer.writerow([url, '', 'bad url'])
continue
except requests.exceptions.MissingSchema as e:
writer.writerow([url, '', 'missing http...'])
continue
# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", {"class": "productsPicture"})
if divTag:
# Return all 'a' tags that contain an href
for a in divTag.find_all("a", href=True):
url_sub = a['href']
# Test that link is valid
try:
r = requests.get(url_sub)
writer.writerow([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
writer.writerow([url, url_sub, 'bad link'])
else:
writer.writerow([url, '', 'no results'])
给你:
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,ok
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,no results
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,ok
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,no results
异常处理可以捕获 CSV 文件中的 URL 无效的情况。您还可以测试从页面上的 link 返回的 URL 是否有效。第三列可以为您提供状态,即 ok
、bad url
、no results
或 bad link
.
假设您的 CSV 文件中的所有列都包含 URL 需要测试的内容。
我需要你的帮助来解决这个问题:
我这里有一个有效的 python 脚本:
from bs4 import BeautifulSoup
import requests
import csv
with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile, delimiter=';')
writer = csv.writer(results)
for row in reader:
# get the url
url = row[0]
# fetch content from server
html = requests.get(url).content
# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", {"class": "productsPicture"})
if divTag:
tags = divTag.findAll("a")
else:
continue
for tag in tags:
res = tag.get('href')
if res != None:
writer.writerow([res])
来源:
基本上我需要更改的原因是如何逐行保持输入和输出的一致性。见下文:
所有这一切背后的想法是 get/print 重定向 link,如果工作 link - 打印 link,如果不工作,打印 错误 link 左右
urls.csv样本
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193; - non valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589; - valid
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093; - non valid
您只需要在使用 csv.writerow()
函数编写的列表中添加更多项目:
from bs4 import BeautifulSoup
import requests
import csv
with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile)
writer = csv.writer(results)
for row in reader:
# get the url
for url in row:
url = url.strip()
# Skip any empty URLs
if len(url):
print(url)
# fetch content from server
try:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
writer.writerow([url, '', 'bad url'])
continue
except requests.exceptions.MissingSchema as e:
writer.writerow([url, '', 'missing http...'])
continue
# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", {"class": "productsPicture"})
if divTag:
# Return all 'a' tags that contain an href
for a in divTag.find_all("a", href=True):
url_sub = a['href']
# Test that link is valid
try:
r = requests.get(url_sub)
writer.writerow([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
writer.writerow([url, url_sub, 'bad link'])
else:
writer.writerow([url, '', 'no results'])
给你:
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,ok
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193,https://www.tennis-point.com/asics-gel-game-6-all-court-shoe-men-white-silver-02013802643000.html,no results
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,ok
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093,https://www.tennis-point.com/asics-gel-resolution-7-clay-court-shoe-men-blue-lime-02014202831000.html,no results
异常处理可以捕获 CSV 文件中的 URL 无效的情况。您还可以测试从页面上的 link 返回的 URL 是否有效。第三列可以为您提供状态,即 ok
、bad url
、no results
或 bad link
.
假设您的 CSV 文件中的所有列都包含 URL 需要测试的内容。