使用包含带 Python 和漂亮汤的 URL 的 .txt 文件从多个网页抓取数据
Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup
我有一个 .txt 文件,其中包含完整的 URL 到许多页面,每个页面都包含一个 table 我想从中抓取数据。我的代码适用于一个 URL,但是当我尝试添加循环并从 .txt 文件中读取 URL 时,出现以下错误
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
这是我的代码
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
f.close()
我检查了我的.txt 文件,所有条目都正常。它们以 HTTP: 开头,以 .html 结尾。它们周围没有撇号或引号。我对 for 循环的编码不正确吗?
正在使用
with open('urls.txt', 'r') as f:
for url in f:
print(url)
我得到以下信息
??http://www.thegreenpapers.com/PCC/AL-D.html
http://www.thegreenpapers.com/PCC/AL-R.html
http://www.thegreenpapers.com/PCC/AK-D.html
以此类推 100 行。只有第一行有问号。
我的 .txt 文件包含那些 URL,只有州和政党的缩写发生变化。
您不能使用 'f.read()' 将整个文件读入一个字符串,然后迭代该字符串。要解决,请参阅下面的更改。我还删除了你的最后一行。当您使用 'with' 语句时,它将在代码块完成时关闭文件。
Code from Greg Hewgill for (Python 2) 显示 url 字符串的类型是 'str' 还是 'unicode'.
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
# Code from Greg Hewgill
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
with open('urls.txt', 'r') as f:
for url in f:
print(url)
whatisthis(url)
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
运行 包含上面列出的 URL 的文本文件的代码会产生以下输出:
http://www.thegreenpapers.com/PCC/AL-D.html
ordinary string
Gore, Al
54. 84%
Uncommitted
10. 16%
LaRouche, Lyndon
http://www.thegreenpapers.com/PCC/AL-R.html
ordinary string
Bush, George W.
44. 100%
Keyes, Alan
Uncommitted
http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13. 68%
Uncommitted
6. 32%
Bradley, Bill
您尝试过的方法可以通过在代码中调整两行不同的代码来解决。
试试这个:
with open('urls.txt', 'r') as f:
urls = f.readlines() #make sure this line is properly indented.
for url in urls:
uClient = urlopen(url.strip())
我有一个 .txt 文件,其中包含完整的 URL 到许多页面,每个页面都包含一个 table 我想从中抓取数据。我的代码适用于一个 URL,但是当我尝试添加循环并从 .txt 文件中读取 URL 时,出现以下错误
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
这是我的代码
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
f.close()
我检查了我的.txt 文件,所有条目都正常。它们以 HTTP: 开头,以 .html 结尾。它们周围没有撇号或引号。我对 for 循环的编码不正确吗?
正在使用
with open('urls.txt', 'r') as f:
for url in f:
print(url)
我得到以下信息
??http://www.thegreenpapers.com/PCC/AL-D.html
http://www.thegreenpapers.com/PCC/AL-R.html
http://www.thegreenpapers.com/PCC/AK-D.html
以此类推 100 行。只有第一行有问号。 我的 .txt 文件包含那些 URL,只有州和政党的缩写发生变化。
您不能使用 'f.read()' 将整个文件读入一个字符串,然后迭代该字符串。要解决,请参阅下面的更改。我还删除了你的最后一行。当您使用 'with' 语句时,它将在代码块完成时关闭文件。
Code from Greg Hewgill for (Python 2) 显示 url 字符串的类型是 'str' 还是 'unicode'.
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
# Code from Greg Hewgill
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
with open('urls.txt', 'r') as f:
for url in f:
print(url)
whatisthis(url)
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
运行 包含上面列出的 URL 的文本文件的代码会产生以下输出:
http://www.thegreenpapers.com/PCC/AL-D.html
ordinary string
Gore, Al
54. 84%
Uncommitted
10. 16%
LaRouche, Lyndon
http://www.thegreenpapers.com/PCC/AL-R.html
ordinary string
Bush, George W.
44. 100%
Keyes, Alan
Uncommitted
http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13. 68%
Uncommitted
6. 32%
Bradley, Bill
您尝试过的方法可以通过在代码中调整两行不同的代码来解决。
试试这个:
with open('urls.txt', 'r') as f:
urls = f.readlines() #make sure this line is properly indented.
for url in urls:
uClient = urlopen(url.strip())