我正在尝试从 python 中的网站提取数据

Question

def convert():
    for url in url_list:
        news=Article(url)
        news.download()
        while news.download_state != 2:
            time.sleep(1)
        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )

convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

函数 convert() 遍历 url 的列表并处理它们中的每一个。每个 url 是一篇文章的 link。我正在获取文章的重要属性，例如作者、文本等，然后将其存储在数据框中。之后，我将数据框转换为 csv 文件。脚本运行大约 5 个小时，因为 url_list 中有 589 个 url。但我仍然无法获取 csv 文件。有人可以发现我哪里出错了。

Answer 1

假设这是你的整个程序，你需要 return l from convert.

def convert():
    for url in url_list:
        news=Article(url)
        news.download()
        while news.download_state != 2:
            time.sleep(1)
        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )
    return l 

l = convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

Answer 2

可能你的功能到此为止：

    while news.download_state != 2:
        time.sleep(1)

它正在等待下载状态的改变，但它从未发生过。你的函数还应该 return 一个列表

像这样的东西应该可以工作：

def convert():
    for url in url_list:
        news=Article(url)
        news.download()

        news.parse()
        l.append(
            {'Title':news.title, 'Text': news.text.replace('\n',' '), 'Date':news.publish_date, 'Author':news.authors}
        )
    return l 

l = convert()
df = pd.DataFrame.from_dict(l)
df.to_csv('Amazon_try2'+'.csv',encoding='utf-8', index=False)

我正在尝试从 python 中的网站提取数据

I am trying to extract data from a website in python

python

dataframe

web-scraping

python-newspaper