尝试在 ruby 中使用 open-uri,一些 HTML 内容作为 "Loading..." 进入
Trying to use open-uri in ruby, some HTML contents are coming in as "Loading..."
我正在尝试创建一个程序来比较网页上的特定内容,然后再进行比较,我目前正在努力获取将要更改的信息。但是,如果我检查页面中的元素,会出现会改变的文本,但如果我使用 open-uri 则不会,它以 "Loading..." 的形式出现(见图),有没有办法获得所有 HTML 文字?
Picture here.
这是我目前的代码
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
任何帮助获取正在加载...以更改为实际 HTML 代码的方法都很棒。
谢谢
您的网页仅包含 ajax request
和 open-uri
returns 服务器端页面,它不等待 ajax 请求
您可以使用下面的代码等待页面加载
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
注意:您需要 chromedriver
才能使用脚本 http://chromedriver.chromium.org/downloads
如果你不想在浏览器中打开 url 那么你可以使用 headless-WebKit
问题
因此,open uri 只会发出 HTTP 请求并让您访问正文。在这种情况下,正文是 html。 html 具有此数据的占位符,这就是您所看到的。然后 html 表示加载一些 javascript ,这将向服务器发出另一个请求以获取数据,当数据进来时,它将用真实数据替换占位符。因此,要处理此问题,您最终需要从 javascript 发出的请求返回的任何内容。
三个解决方案
从我最不喜欢到我最喜欢的顺序排列。
- 您可以尝试计算 JavaScript 以使其在 html 上运行。这会很痛苦,所以我不推荐它,但如果你想走那条路,我认为有一个叫做 "the ruby racer" 的 gem 或其他东西(IIRC,它包装了 v8)。
- 您可以启动网络浏览器,让浏览器处理所有的 cray cray,然后在更新后向浏览器请求 html。这就是 Rahul 的解决方案所做的,它是一个非常好的解决方案。这不是我最喜欢的,因为它很重,而且您只能查看 html 中显示的信息。这称为 "scraping",它非常脆弱(某些设计人员在页面周围移动了一些东西,您的脚本就会中断),并且信息采用人工呈现格式,这意味着您通常必须做很多小的解析工作。
- 您可以打开浏览器的开发工具,转到网络选项卡,过滤 XHR 请求,然后重新加载页面。其中之一请求获取用于填充占位符的数据。找出它是哪一个,然后您可以自己提出该请求。这在某些方面也可能很脆弱,例如,有时您必须拥有正确的 cookie,并且您经常必须试验浏览器发送的内容以确定您需要多少(通常比发送的少得多,这是适用于您的情况)。 提示:当你这样做时,将请求数据与解析和探索数据分开(即将其保存到文件中,然后在查看数据时从文件中获取数据而不是制作数据)每次都有一个新的请求...这样它就不会改变你,你也不会受到速率限制)
解决方案 #3
所以,我很好奇,于是自己尝试了第 3 种解决方案,效果非常好,请检查一下:
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => ",972"
listing['priceString'] # => ",890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]
我正在尝试创建一个程序来比较网页上的特定内容,然后再进行比较,我目前正在努力获取将要更改的信息。但是,如果我检查页面中的元素,会出现会改变的文本,但如果我使用 open-uri 则不会,它以 "Loading..." 的形式出现(见图),有没有办法获得所有 HTML 文字?
Picture here.
这是我目前的代码
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
任何帮助获取正在加载...以更改为实际 HTML 代码的方法都很棒。
谢谢
您的网页仅包含 ajax request
和 open-uri
returns 服务器端页面,它不等待 ajax 请求
您可以使用下面的代码等待页面加载
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
注意:您需要 chromedriver
才能使用脚本 http://chromedriver.chromium.org/downloads
如果你不想在浏览器中打开 url 那么你可以使用 headless-WebKit
问题
因此,open uri 只会发出 HTTP 请求并让您访问正文。在这种情况下,正文是 html。 html 具有此数据的占位符,这就是您所看到的。然后 html 表示加载一些 javascript ,这将向服务器发出另一个请求以获取数据,当数据进来时,它将用真实数据替换占位符。因此,要处理此问题,您最终需要从 javascript 发出的请求返回的任何内容。
三个解决方案
从我最不喜欢到我最喜欢的顺序排列。
- 您可以尝试计算 JavaScript 以使其在 html 上运行。这会很痛苦,所以我不推荐它,但如果你想走那条路,我认为有一个叫做 "the ruby racer" 的 gem 或其他东西(IIRC,它包装了 v8)。
- 您可以启动网络浏览器,让浏览器处理所有的 cray cray,然后在更新后向浏览器请求 html。这就是 Rahul 的解决方案所做的,它是一个非常好的解决方案。这不是我最喜欢的,因为它很重,而且您只能查看 html 中显示的信息。这称为 "scraping",它非常脆弱(某些设计人员在页面周围移动了一些东西,您的脚本就会中断),并且信息采用人工呈现格式,这意味着您通常必须做很多小的解析工作。
- 您可以打开浏览器的开发工具,转到网络选项卡,过滤 XHR 请求,然后重新加载页面。其中之一请求获取用于填充占位符的数据。找出它是哪一个,然后您可以自己提出该请求。这在某些方面也可能很脆弱,例如,有时您必须拥有正确的 cookie,并且您经常必须试验浏览器发送的内容以确定您需要多少(通常比发送的少得多,这是适用于您的情况)。 提示:当你这样做时,将请求数据与解析和探索数据分开(即将其保存到文件中,然后在查看数据时从文件中获取数据而不是制作数据)每次都有一个新的请求...这样它就不会改变你,你也不会受到速率限制)
解决方案 #3
所以,我很好奇,于是自己尝试了第 3 种解决方案,效果非常好,请检查一下:
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => ",972"
listing['priceString'] # => ",890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]