Beautifulsoup 将 <link> 变成 <link/>
Beautifulsoup turning <link> into <link/>
我正在尝试从一些 RSS 提要中下载和解析文本,例如 http://rss.sciencedirect.com/publication/science/03043878。这是一个简单的例子:
import urllib.request
import urllib.parse
import requests
from bs4 import BeautifulSoup
def main():
soup = BeautifulSoup(urllib.request.urlopen('http://rss.sciencedirect.com/publication/science/03043878'),"html.parser").encode("ascii")
print(soup)
if __name__ == '__main__':
main()
原文html(直接看网站的话),链接前面是<link>
,后面是</link>
。但是 beautifulsoup 打印出的内容将 <link>
替换为 <link/>
并完全删除了 </link>
。知道我可能做错了什么或者这是一个错误吗?
PS 尝试将编码更改为 utf-8,但它仍然发生。
解析器无法正确评估链接。对于这个问题,你应该使用 xml
作为你的解析器而不是 html.parser
.
soup = BeautifulSoup(urllib.request.urlopen('http://rss.sciencedirect.com/publication/science/03043878'),"xml")
print(len(soup.find_all("link")))
输出 52 个链接。
您正在解析 RSS。 RSS 为 XML。所以将 features="xml" 传递给 BeautifulSoup 构造函数。
import urllib.request
from bs4 import BeautifulSoup
def main():
doc = BeautifulSoup(urllib.request.urlopen('http://rss.sciencedirect.com/publication/science/03043878'), "xml")
# If you want to print it as ascii (as per your original post).
print (doc.prettify('ascii'))
# To write it to an file as ascii (as per your original post).
with open("ascii.txt", "wb") as file:
file.write(doc.prettify('ascii'))
# To write it to an file as utf-8 (as the original RSS).
with open("utf-8.txt", "wb") as file:
file.write(doc.prettify('utf-8'))
# If you want to print the links.
for item in doc.findAll('link'):
print(item)
if __name__ == '__main__':
main()
文件和终端中的输出:
... <link>
http://rss.sciencedirect.com/action/redirectFile?&zone=main&currentActivity=feed&usageType=outward&url=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%3F_ob%3DGatewayURL%26_origin%3DIRSSSEARCH%26_method%3DcitationSearch%26_piikey%3DS0304387817300512%26_version%3D1%26md5%3D16ed8e2672e8048590d3c41993306b0f
</link> ...
我正在尝试从一些 RSS 提要中下载和解析文本,例如 http://rss.sciencedirect.com/publication/science/03043878。这是一个简单的例子:
import urllib.request
import urllib.parse
import requests
from bs4 import BeautifulSoup
def main():
soup = BeautifulSoup(urllib.request.urlopen('http://rss.sciencedirect.com/publication/science/03043878'),"html.parser").encode("ascii")
print(soup)
if __name__ == '__main__':
main()
原文html(直接看网站的话),链接前面是<link>
,后面是</link>
。但是 beautifulsoup 打印出的内容将 <link>
替换为 <link/>
并完全删除了 </link>
。知道我可能做错了什么或者这是一个错误吗?
PS 尝试将编码更改为 utf-8,但它仍然发生。
解析器无法正确评估链接。对于这个问题,你应该使用 xml
作为你的解析器而不是 html.parser
.
soup = BeautifulSoup(urllib.request.urlopen('http://rss.sciencedirect.com/publication/science/03043878'),"xml")
print(len(soup.find_all("link")))
输出 52 个链接。
您正在解析 RSS。 RSS 为 XML。所以将 features="xml" 传递给 BeautifulSoup 构造函数。
import urllib.request
from bs4 import BeautifulSoup
def main():
doc = BeautifulSoup(urllib.request.urlopen('http://rss.sciencedirect.com/publication/science/03043878'), "xml")
# If you want to print it as ascii (as per your original post).
print (doc.prettify('ascii'))
# To write it to an file as ascii (as per your original post).
with open("ascii.txt", "wb") as file:
file.write(doc.prettify('ascii'))
# To write it to an file as utf-8 (as the original RSS).
with open("utf-8.txt", "wb") as file:
file.write(doc.prettify('utf-8'))
# If you want to print the links.
for item in doc.findAll('link'):
print(item)
if __name__ == '__main__':
main()
文件和终端中的输出:
... <link>
http://rss.sciencedirect.com/action/redirectFile?&zone=main&currentActivity=feed&usageType=outward&url=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%3F_ob%3DGatewayURL%26_origin%3DIRSSSEARCH%26_method%3DcitationSearch%26_piikey%3DS0304387817300512%26_version%3D1%26md5%3D16ed8e2672e8048590d3c41993306b0f
</link> ...