使用包含带 Python 和漂亮汤的 URL 的 .txt 文件从多个网页抓取数据

Question

我有一个 .txt 文件，其中包含完整的 URL 到许多页面，每个页面都包含一个 table 我想从中抓取数据。我的代码适用于一个 URL，但是当我尝试添加循环并从 .txt 文件中读取 URL 时，出现以下错误

raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?

这是我的代码

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:

    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")

    containers = page_soup.findAll("tr", {"class":"data"})


    for container in containers:
        unform_name = container.findAll("th", {"width":"30%"})
        name = unform_name[0].text.strip()

        unform_delegate = container.findAll("td", {"id":"y000"})
        delegate = unform_delegate[0].text.strip()

        print(name)
        print(delegate)

f.close()

我检查了我的.txt 文件，所有条目都正常。它们以 HTTP: 开头，以 .html 结尾。它们周围没有撇号或引号。我对 for 循环的编码不正确吗？

正在使用

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)

我得到以下信息

??http://www.thegreenpapers.com/PCC/AL-D.html

http://www.thegreenpapers.com/PCC/AL-R.html

http://www.thegreenpapers.com/PCC/AK-D.html

以此类推 100 行。只有第一行有问号。我的 .txt 文件包含那些 URL，只有州和政党的缩写发生变化。

Answer 1

您不能使用 'f.read()' 将整个文件读入一个字符串，然后迭代该字符串。要解决，请参阅下面的更改。我还删除了你的最后一行。当您使用 'with' 语句时，它将在代码块完成时关闭文件。

Code from Greg Hewgill for (Python 2) 显示 url 字符串的类型是 'str' 还是 'unicode'.

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)

运行包含上面列出的 URL 的文本文件的代码会产生以下输出：

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill

Answer 2

您尝试过的方法可以通过在代码中调整两行不同的代码来解决。

试试这个：

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

使用包含带 Python 和漂亮汤的 URL 的 .txt 文件从多个网页抓取数据

Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

python

beautifulsoup

web-scraping

valueerror