使用 BeautifulSoup 和 ftlib 访问 ftp 网站时出错
Error accessing ftp website with BeautifulSoup and ftlib
我正在尝试访问一个网页来下载一些这样的数据:
from bs4 import BeautifulSoup
import urllib.request
from lxml import html
download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
s = requests.session()
page = BeautifulSoup(s.get(download_url).text, "lxml")
但是这个returns:
Traceback (most recent call last):
File "<ipython-input-271-59c5b15a7e34>", line 1, in <module>
r = requests.get(download_url)
File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 612, in send
adapter = self.get_adapter(url=request.url)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 703, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found for 'ftp://nomads.ncdc.noaa.gov/NARR_monthly/'
即使该网站正在运行。
通常我会循环遍历每个 link,如果它有效的话:
for a in page.find_all('a', href=True):
file = a['href']
print (file)
我也试过这个:
import ftplib
ftp = ftplib.FTP(download_url)
但是这个returns:
File "<ipython-input-284-60bd19e600fe>", line 1, in <module>
ftp = ftplib.FTP(download_url)
File "/anaconda3/lib/python3.6/ftplib.py", line 117, in __init__
self.connect(host)
File "/anaconda3/lib/python3.6/ftplib.py", line 152, in connect
source_address=self.source_address)
File "/anaconda3/lib/python3.6/socket.py", line 704, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/anaconda3/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
gaierror: [Errno 8] nodename nor servname provided, or not known
很遗憾 requests
不支持 FTP 链接,但您可以使用内置的 urllib
模块。
import urllib.request
download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
with urllib.request.urlopen(download_url) as r:
data = r.read()
print(data)
响应不是 html,因此您无法使用 BeautifulSoup
解析它,但您可以使用正则表达式或字符串操作。
links = [
download_url + line.split()[-1]
for line in data.decode().splitlines()
]
for link in links:
print(link)
如果愿意,您也可以使用 ftplib
,但您必须仅使用主机名。然后你可以 cd 到 'NARR_monthly' 并获取数据。
from ftplib import FTP
with FTP('nomads.ncdc.noaa.gov') as ftp:
ftp.login()
ftp.cwd('NARR_monthly')
data = ftp.nlst()
path = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
links = [path + i for i in data]
有时主机会因为客户端太多而拒绝连接,因此您可能需要使用 try-except 块。
我正在尝试访问一个网页来下载一些这样的数据:
from bs4 import BeautifulSoup
import urllib.request
from lxml import html
download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
s = requests.session()
page = BeautifulSoup(s.get(download_url).text, "lxml")
但是这个returns:
Traceback (most recent call last):
File "<ipython-input-271-59c5b15a7e34>", line 1, in <module>
r = requests.get(download_url)
File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 612, in send
adapter = self.get_adapter(url=request.url)
File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 703, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found for 'ftp://nomads.ncdc.noaa.gov/NARR_monthly/'
即使该网站正在运行。
通常我会循环遍历每个 link,如果它有效的话:
for a in page.find_all('a', href=True):
file = a['href']
print (file)
我也试过这个:
import ftplib
ftp = ftplib.FTP(download_url)
但是这个returns:
File "<ipython-input-284-60bd19e600fe>", line 1, in <module>
ftp = ftplib.FTP(download_url)
File "/anaconda3/lib/python3.6/ftplib.py", line 117, in __init__
self.connect(host)
File "/anaconda3/lib/python3.6/ftplib.py", line 152, in connect
source_address=self.source_address)
File "/anaconda3/lib/python3.6/socket.py", line 704, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/anaconda3/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
gaierror: [Errno 8] nodename nor servname provided, or not known
很遗憾 requests
不支持 FTP 链接,但您可以使用内置的 urllib
模块。
import urllib.request
download_url = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
with urllib.request.urlopen(download_url) as r:
data = r.read()
print(data)
响应不是 html,因此您无法使用 BeautifulSoup
解析它,但您可以使用正则表达式或字符串操作。
links = [
download_url + line.split()[-1]
for line in data.decode().splitlines()
]
for link in links:
print(link)
如果愿意,您也可以使用 ftplib
,但您必须仅使用主机名。然后你可以 cd 到 'NARR_monthly' 并获取数据。
from ftplib import FTP
with FTP('nomads.ncdc.noaa.gov') as ftp:
ftp.login()
ftp.cwd('NARR_monthly')
data = ftp.nlst()
path = "ftp://nomads.ncdc.noaa.gov/NARR_monthly/"
links = [path + i for i in data]
有时主机会因为客户端太多而拒绝连接,因此您可能需要使用 try-except 块。