如何将 Nokogiri 元素分配给哈希键
How to assign a Nokogiri Element to hash key
我正在抓取 Techcrunch.com 并抓取标题,URL 并预览每篇文章的文本。
我有:
require 'nokogiri'
require 'open-uri'
class TestScraper::Scraper
@doc = Nokogiri::HTML(open("https://techcrunch.com")
def scrape_tech_crunch
articles = @doc.css("h2.post-block__title").css("a")
top_stories = articles.each do |story|
stories = {
:title => story.children.text.strip,
:url => story.attribute("href").value,
:preview => @doc.css("div.post-block__content").children.first.text
}
TestScraper::Article.new(stories)
end
end
end
TestScraper::Article.new(stories)
将散列作为参数并使用它来初始化文章 class:
class TestScraper::Article
attr_accessor :title, :url, :preview
@@all = []
def initialize(hash)
hash.each do |k, v|
self.send "#{k}=", v
end
@@all << self
end
def self.all
@@all
end
end
当我 运行 TestScraper::Scraper.new("https://techcrunch.com").scrape_tech_crunch
我得到:
[#<TestScraper::Article:0x00000000015f69e0
@preview=
"\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
@title=
"Millions downloaded dozens of Android apps on Google Play infected with adware",
@url=
"https://techcrunch.com/2019/10/24/millions-dozens-android-apps-adware/">,
#<TestScraper::Article:0x00000000015f5658
@preview=
"\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
@title="Netflix launches mobile-only monthly plan in Malaysia",
@url=
"https://techcrunch.com/2019/10/24/netflix-malaysia-mobile-only-cheap-plan/">
它为文章 class 的每个实例创建一个具有适当标题的 object 和 URL,但它始终为每个文章实例分配相同的预览文本。应该有 20 篇文章,每篇文章都有自己的 "preview"(您在单击 link 阅读完整文章之前获得的文章的小样本)。
您遇到的问题是由于
@doc.css("div.post-block__content").children.first.text
为每个故事选择相同的节点,因为您在全局文档 @doc
上调用它。
而是尝试找到最常见的顶部节点,然后从那里向下移动:
@doc.css('.post-block').map do |story|
# navigate down from the selected node
title = story.at_css('h2.post-block__title a')
preview = story.at_css('div.post-block__content')
TestScraper::Article.new(
title: title.content.strip,
href: title['href'],
preview: preview.content.strip
)
end
如果使用的任何方法提出问题,请查看 Nokogiri cheat sheet。如果您在此之后有任何疑问,请不要害怕在评论中提出。
我正在抓取 Techcrunch.com 并抓取标题,URL 并预览每篇文章的文本。
我有:
require 'nokogiri'
require 'open-uri'
class TestScraper::Scraper
@doc = Nokogiri::HTML(open("https://techcrunch.com")
def scrape_tech_crunch
articles = @doc.css("h2.post-block__title").css("a")
top_stories = articles.each do |story|
stories = {
:title => story.children.text.strip,
:url => story.attribute("href").value,
:preview => @doc.css("div.post-block__content").children.first.text
}
TestScraper::Article.new(stories)
end
end
end
TestScraper::Article.new(stories)
将散列作为参数并使用它来初始化文章 class:
class TestScraper::Article
attr_accessor :title, :url, :preview
@@all = []
def initialize(hash)
hash.each do |k, v|
self.send "#{k}=", v
end
@@all << self
end
def self.all
@@all
end
end
当我 运行 TestScraper::Scraper.new("https://techcrunch.com").scrape_tech_crunch
我得到:
[#<TestScraper::Article:0x00000000015f69e0
@preview=
"\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
@title=
"Millions downloaded dozens of Android apps on Google Play infected with adware",
@url=
"https://techcrunch.com/2019/10/24/millions-dozens-android-apps-adware/">,
#<TestScraper::Article:0x00000000015f5658
@preview=
"\n\t\tSecurity researchers have found dozens of Android apps in the Google Play store serving ads to unsuspecting victims as part of a money-making scheme. ESET researchers found 42 apps conta
ining adware, \t",
@title="Netflix launches mobile-only monthly plan in Malaysia",
@url=
"https://techcrunch.com/2019/10/24/netflix-malaysia-mobile-only-cheap-plan/">
它为文章 class 的每个实例创建一个具有适当标题的 object 和 URL,但它始终为每个文章实例分配相同的预览文本。应该有 20 篇文章,每篇文章都有自己的 "preview"(您在单击 link 阅读完整文章之前获得的文章的小样本)。
您遇到的问题是由于
@doc.css("div.post-block__content").children.first.text
为每个故事选择相同的节点,因为您在全局文档 @doc
上调用它。
而是尝试找到最常见的顶部节点,然后从那里向下移动:
@doc.css('.post-block').map do |story|
# navigate down from the selected node
title = story.at_css('h2.post-block__title a')
preview = story.at_css('div.post-block__content')
TestScraper::Article.new(
title: title.content.strip,
href: title['href'],
preview: preview.content.strip
)
end
如果使用的任何方法提出问题,请查看 Nokogiri cheat sheet。如果您在此之后有任何疑问,请不要害怕在评论中提出。