爬取wordreference的问题

Question

我正在尝试抓取 wordreference，但我没有成功。

我遇到的第一个问题是，很大一部分是通过JavaScript加载的，但是这应该不是什么大问题，因为我可以在源代码中看到我需要的东西。

所以，例如，我想提取给定单词的前两个含义，所以在这个 url: http://www.wordreference.com/es/translation.asp?tranword=crane 我需要提取 grulla 和 grúa。

这是我的代码：

import lxml.html as lh
import urllib2

url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[@class="ToWrd"]/text()')

for i in trans:

    print i

结果是我得到一个空列表。

我也试过用scrapy爬，没成功。我不确定发生了什么，我能够抓取它的唯一方法是使用 curl，但这很笨拙，我想以一种优雅的方式使用 Python。

非常感谢

Answer 1

您似乎需要发送 User-Agent header，请参阅 Changing user agent on urllib2.urlopen。

此外，只需切换到 requests 即可（默认情况下它会自动发送 python-requests/version 用户代理）：

import lxml.html as lh
import requests

url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'

response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)

trans = doc.xpath('//td[@class="ToWrd"]/text()')
for i in trans:
    print(i)

打印：

grulla 
grúa 
plataforma 
...
grulla blanca 
grulla trompetera

爬取wordreference的问题

Problems crawling wordreference

python

xpath

lxml

web-crawler

web-scraping