如何在运行调度程序进行网络抓取时避免错误？

Question

我需要通过 scraping 收集一些数据（合法），所以现在在我自己的着陆页上测试脚本。目标是在 3 小时内获取标签中的特定文本（在我的示例中它只是一个句子）一次。我每 1 秒测试一次代码（因此希望看到 5 行“СОСТАВЛЯЕМ СМЕТЫ”5 秒）。但是代码的执行只写了一次短语，之后 returns 出错了。

import schedule
import time
from urllib.request import urlopen
from bs4 import BeautifulSoup

mf = open("C:\Users\Admin\Desktop\huyandex.txt",'a')


def job():
    html = urlopen("https://smeta-spb.com/")
    #print(html.read())
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("h1")
    #print(len(nameList))
    for name in nameList:
        mf.write(name.get_text())
        mf.write('\n')
    mf.close()
    
schedule.every(5).seconds.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

但出现错误：

I/O operation on closed file.

如何转换代码以便我可以将内容写入文件？

Answer 1

您可以简单地在循环中打开一个文件，从循环开始处删除 open 语句并将其移动到循环中，这样您的代码就可以像这样：

import schedule
import time
from urllib.request import urlopen
from bs4 import BeautifulSoup



def job():
    mf=open("huyandex.txt",'a') # moved it inside the function
    html = urlopen("https://smeta-spb.com/")
    #print(html.read())
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("h1")
    #print(len(nameList))
    for name in nameList:
        mf.write(name.get_text())
        mf.write('\n')
    mf.close()
    
schedule.every(5).seconds.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

如何在运行调度程序进行网络抓取时避免错误？

How to avoid error while running scheduler for web-scraping?

python

scheduler

web-scraping

如何在 运行 调度程序进行网络抓取时避免错误？

How to avoid error while running scheduler for web-scraping?

python

scheduler

web-scraping

如何在运行调度程序进行网络抓取时避免错误？