BeautifulSoup 输出的奇怪的编码文件格式
Weird encoding file format outputted by BeautifulSoup
我想访问和抓取此 link 中的数据。
哪里;
new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'
通过以下行
访问new_url
req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})
产生错误
Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
起草了一组新行
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())
虽然没有抛出错误,但是
print(page_soup.prettify())
输出一些无法识别的文本格式
6�>�.�t1k�e�LH�.��]WO�?m�^@�
څ��#�h[>��!�H8����|����n(XbU<~�k�"���#g+�4�Ǻ�Xv�7�UȢB2�
�7�F8�XA��W\�ɚ��^8w��38�@'
SH�<_0�B���oy�5Bނ)E���GPq:�ќU�c���ab�h�$<ra�
;o�Q�a@ð�d\�&J3Τa�����:�I�etf�a���h�$(M�~���ua�$�
n�&9u%ҵ*b���w�j�V��P�D�'z[��������)
带有警告
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
我怀疑,这可以通过使用utf-8
编码来解决,如下
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
print(page_soup.prettify())
但是编译return报错
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
我可以知道问题出在哪里,感谢任何见解。
尝试使用请求库
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(new_url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())
您仍然可以在这里使用 cookies
编辑:更新代码以显示 headers 的使用,这会告诉网站您是浏览器而不是程序 - 但进一步的登录操作我建议使用 selenium 而不是 requests
如果您想使用 urllib
库,请从 headers 中删除 Accept-Encoding
(为简单起见,也仅指定 Accept-Charset
utf-8
):
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
结果是:
<!DOCTYPE html>
<!-- Form Name: START -->
<html lang="en">
<!-- Template_Component_Name: id.start.vm -->
<head>
<meta charset="utf-8"/>
...etc.
我想访问和抓取此 link 中的数据。
哪里;
new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'
通过以下行
访问new_url
req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})
产生错误
Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
起草了一组新行
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())
虽然没有抛出错误,但是
print(page_soup.prettify())
输出一些无法识别的文本格式
6�>�.�t1k�e�LH�.��]WO�?m�^@� څ��#�h[>��!�H8����|����n(XbU<~�k�"���#g+�4�Ǻ�Xv�7�UȢB2� �7�F8�XA��W\�ɚ��^8w��38�@' SH�<_0�B���oy�5Bނ)E���GPq:�ќU�c���ab�h�$<ra� ;o�Q�a@ð�d\�&J3Τa�����:�I�etf�a���h�$(M�~���ua�$� n�&9u%ҵ*b���w�j�V��P�D�'z[��������)
带有警告
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
我怀疑,这可以通过使用utf-8
编码来解决,如下
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
print(page_soup.prettify())
但是编译return报错
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
我可以知道问题出在哪里,感谢任何见解。
尝试使用请求库
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(new_url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())
您仍然可以在这里使用 cookies
编辑:更新代码以显示 headers 的使用,这会告诉网站您是浏览器而不是程序 - 但进一步的登录操作我建议使用 selenium 而不是 requests
如果您想使用 urllib
库,请从 headers 中删除 Accept-Encoding
(为简单起见,也仅指定 Accept-Charset
utf-8
):
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
结果是:
<!DOCTYPE html>
<!-- Form Name: START -->
<html lang="en">
<!-- Template_Component_Name: id.start.vm -->
<head>
<meta charset="utf-8"/>
...etc.