当我使用 selenium 抓取网站时出现 python UnicodeEncodeError
get python UnicodeEncodeError when i was using selenium to crawl a website
我正在尝试使用 selenium 来抓取该网站上的论文标题:http://www.ncbi.nlm.nih.gov/pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)
#coding="utf-8"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail
browser = webdriver.Firefox()
browser.get(url)
time.sleep(5)
def extract_data(browser):
titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
return [title.text for title in titles]
page_start = 1
page_end = 10
f = open('titles.txt', 'a')
for page in range(page_start, page_end):
print "page %d" % page
page_jump_box = browser.find_element_by_class_name("num").clear()
page_jump_box_cleared = browser.find_element_by_class_name("num")
page_jump_box_cleared.send_keys(str(page) + Keys.RETURN)
time.sleep(15)
f = open('titles.txt', 'a')
for line in extract_data(browser):
f.write(line + '\n')
f.close()
当我 运行 它时,我得到了这个:
gao@gao:~/crawler$ python crawler3.0.py
page 1
page 2
page 3
page 4
Traceback (most recent call last):
File "crawler3.0.py", line 33, in <module>
f.write(line + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 36: ordinal not in range(128)
当我在 Whosebug 上搜索时,我发现了一个类似的问题:UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)。
我了解到当你使用 str() 时,它会导致我的代码中的 unicode problem.but,我只使用 str() 使 page
数字成为 string.So,如何更正代码。
这是另一个 question.I 我了解到如果我想将 phantomjs 与 selenium 一起使用,我只需要将 browser = webdriver.Firefox()
更改为 browser = webdriver.PhantomJS()
,但是当我这样做时,我抓取的内容重复了(只抓取了第1页的标题)
我的母语不是英语,如果有任何语法错误或任何错误,请告诉我。
提前致谢。
您需要 encode
写入文件之前的行:
for line in extract_data(browser):
f.write(line.encode('utf-8') + '\n')
关于您的第二个问题,我建议进行以下改进(这将使其有效):
- 使用 Explicit Waits 而不是
time.sleep()
调用 - 这也会显着提高性能
- 不要键入页码,而是单击 "Next" 按钮
- 在循环之前以"append"模式打开文件并使用
with
context manager
close()
完成后的浏览器
代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail
browser = webdriver.PhantomJS()
browser.get(url)
def extract_data(browser):
titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
return [title.text for title in titles]
page_start, page_end = 1, 10
with open('titles.txt', 'a') as f:
for page in range(page_start, page_end):
WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.rprt p.title"))
)
for line in extract_data(browser):
f.write(line.encode('utf-8') + '\n')
print "page %d" % page
browser.find_element_by_css_selector("div.pagination a.next").click()
browser.close()
这会生成 titles.txt
,标题来自结果页面 1-9:
Robotic-assisted tubal anastomosis with one-stitch technique.
Effectiveness of pictorial health warnings on cigarette packs among Lebanese school and university students.
...
Importance and globalization status of good manufacturing practice (GMP) requirements for pharmaceutical excipients.
Systemic review on drug related hospital admissions - A pubmed based search.
我正在尝试使用 selenium 来抓取该网站上的论文标题:http://www.ncbi.nlm.nih.gov/pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)
#coding="utf-8"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail
browser = webdriver.Firefox()
browser.get(url)
time.sleep(5)
def extract_data(browser):
titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
return [title.text for title in titles]
page_start = 1
page_end = 10
f = open('titles.txt', 'a')
for page in range(page_start, page_end):
print "page %d" % page
page_jump_box = browser.find_element_by_class_name("num").clear()
page_jump_box_cleared = browser.find_element_by_class_name("num")
page_jump_box_cleared.send_keys(str(page) + Keys.RETURN)
time.sleep(15)
f = open('titles.txt', 'a')
for line in extract_data(browser):
f.write(line + '\n')
f.close()
当我 运行 它时,我得到了这个:
gao@gao:~/crawler$ python crawler3.0.py
page 1
page 2
page 3
page 4
Traceback (most recent call last):
File "crawler3.0.py", line 33, in <module>
f.write(line + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 36: ordinal not in range(128)
当我在 Whosebug 上搜索时,我发现了一个类似的问题:UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)。
我了解到当你使用 str() 时,它会导致我的代码中的 unicode problem.but,我只使用 str() 使 page
数字成为 string.So,如何更正代码。
这是另一个 question.I 我了解到如果我想将 phantomjs 与 selenium 一起使用,我只需要将 browser = webdriver.Firefox()
更改为 browser = webdriver.PhantomJS()
,但是当我这样做时,我抓取的内容重复了(只抓取了第1页的标题)
我的母语不是英语,如果有任何语法错误或任何错误,请告诉我。
提前致谢。
您需要 encode
写入文件之前的行:
for line in extract_data(browser):
f.write(line.encode('utf-8') + '\n')
关于您的第二个问题,我建议进行以下改进(这将使其有效):
- 使用 Explicit Waits 而不是
time.sleep()
调用 - 这也会显着提高性能 - 不要键入页码,而是单击 "Next" 按钮
- 在循环之前以"append"模式打开文件并使用
with
context manager close()
完成后的浏览器
代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail
browser = webdriver.PhantomJS()
browser.get(url)
def extract_data(browser):
titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
return [title.text for title in titles]
page_start, page_end = 1, 10
with open('titles.txt', 'a') as f:
for page in range(page_start, page_end):
WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.rprt p.title"))
)
for line in extract_data(browser):
f.write(line.encode('utf-8') + '\n')
print "page %d" % page
browser.find_element_by_css_selector("div.pagination a.next").click()
browser.close()
这会生成 titles.txt
,标题来自结果页面 1-9:
Robotic-assisted tubal anastomosis with one-stitch technique.
Effectiveness of pictorial health warnings on cigarette packs among Lebanese school and university students.
...
Importance and globalization status of good manufacturing practice (GMP) requirements for pharmaceutical excipients.
Systemic review on drug related hospital admissions - A pubmed based search.