我想从 Python 中的 href 标签中分别获取主机名和路径
I want to get hostname and path separately from href tag In Python
我有 python 代码,我想从那里分别获取主机名和路径。例如 www.whosebug.com/questions/ask 我想要这样的结果 "host name is: www.whosebug.com and path is: /questions/ask"
这是我的 python 代码:
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
import socket
import errno
import io
from nyt4 import articalText
url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False)
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.findAll('section', attrs={'class':'health-collection collection'})
for links in maindiv:
atags = soup.findAll('a',href=True)
for link in atags:
alinks= link.get('href')
print alinks.hostname
print alinks.path
但是这段代码给我这个错误:
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
execfile("nytimes/test2.py")
File "nytimes/test2.py", line 21, in <module>
print alinks.hostname
AttributeError: 'unicode' object has no attribute 'hostname'
alinks= link.get('href')
将链接设置为字符串,绝对没有主机名或路径属性,您可以使用 urlparse 获取 path 和主机名:
import mechanize
from bs4 import BeautifulSoup
from urlparse import urlparse
url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False)
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.find_all('section', attrs={'class':'health-collection collection'})
for links in maindiv:
atags = soup.find_all('a',href=True)
for link in atags:
alinks = urlparse(link.get('href'))
print alinks.hostname
print alinks.path
我有 python 代码,我想从那里分别获取主机名和路径。例如 www.whosebug.com/questions/ask 我想要这样的结果 "host name is: www.whosebug.com and path is: /questions/ask"
这是我的 python 代码:
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
import socket
import errno
import io
from nyt4 import articalText
url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False)
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.findAll('section', attrs={'class':'health-collection collection'})
for links in maindiv:
atags = soup.findAll('a',href=True)
for link in atags:
alinks= link.get('href')
print alinks.hostname
print alinks.path
但是这段代码给我这个错误:
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
execfile("nytimes/test2.py")
File "nytimes/test2.py", line 21, in <module>
print alinks.hostname
AttributeError: 'unicode' object has no attribute 'hostname'
alinks= link.get('href')
将链接设置为字符串,绝对没有主机名或路径属性,您可以使用 urlparse 获取 path 和主机名:
import mechanize
from bs4 import BeautifulSoup
from urlparse import urlparse
url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False)
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.find_all('section', attrs={'class':'health-collection collection'})
for links in maindiv:
atags = soup.find_all('a',href=True)
for link in atags:
alinks = urlparse(link.get('href'))
print alinks.hostname
print alinks.path