如何使用 pandas 遍历 csv 行以从 URLS 中提取文本

Question

我有一堆新闻文章的 csv，我希望使用 newspaper3k 包从这些文章中提取正文文本并将它们保存为 txt 文件。我想创建一个脚本来遍历 csv 中的每一行，提取 URL，从 URL 中提取文本，然后将其保存为唯一命名的 txt 文件。有谁知道我该怎么做？我是 Python 的新记者，抱歉，如果这很简单。

我只有下面的代码。在弄清楚如何将每个正文文本保存为 txt 文件之前，我想我应该尝试让脚本打印 csv 中每一行的文本。

import newspaper as newspaper
from newspaper import Article
import sys as sys
import pandas as pd

data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k- 
 scraper/candidate_coverage.csv')

data.head()

for index,row in data.iterrows():
    article_name = Article(url=['link'], language='en')
    article_name.download()
    article_name.parse()
    print(article_name.text)

Answer 1

由于所有 url 都在同一列中，因此使用 for 循环直接访问该列会更容易。我将在这里进行一些解释：

# to access your specific url column
from newspaper import Article
import sys as sys
import pandas as pd

data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k-scraper/candidate_coverage.csv')

for x in data['url_column_name']: #replace 'url_column_name' with the actual name in your df 
    article_name = Article(x, language='en') # x is the url in each row of the column
    article.download()
    article.parse()
    f=open(article.title, 'w') # open a file named the title of the article (could be long)   
    f.write(article.text)
    f.close()

我以前没有尝试过这个包，但阅读发布的教程似乎应该可以。通常，您通过以下行访问数据框中的 url 列： for x in data['url_column_name']: 您将用列的实际名称替换 'url_column_name'。

然后，x 将是第一行中的 url，因此您将把它传递给 Article（根据本教程，您不需要将 x 括起来）。它将首先下载并解析它，然后打开一个与文章标题同名的文件，将文本写入该文件，然后关闭该文件。

然后它将对第二个 x 和第三个 x 执行相同的操作，一直到运行超出 url 秒。

希望对您有所帮助！

如何使用 pandas 遍历 csv 行以从 URLS 中提取文本

How to iterate over csv rows to extract text from URLS using pandas

python

pandas

python-newspaper