使用 Nokogiri 读取和统计单词输出:Undefined Method

Using Nokogiri to read and count word output: Undefined Method

我正在使用 nokogiri 输出电影剧本,我希望能够对该输出进行字数统计。

我已经改编了“Getting viewable text words via Nokogiri”的答案,但是当 运行 时,我在这一行中收到 ActionController::RoutingError (undefined method 'frequencies') 错误:

puts frequencies(content)

这是我正在 运行 编写的代码,而且我对 Rails 还是很陌生,但是我已尽力清理代码以提高可读性:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'pp'

class NokogiriController < ApplicationController
  page = 'http://www.imsdb.com/scripts/Authors-Anonymous.html'
  doc = Nokogiri::HTML(open(page))

  text = doc.css('b').remove
  text = doc.css('pre')

  content = text.to_s.scan(/\w+/)
  puts content.length, content.uniq.length, content.uniq.sort[0..8]

  def frequencies(content)
    Hash[
      content.group_by(&:downcase).map{ |word, instances|
        [word,instances.length]
        }.sort_by(&:last).reverse
      ]
  end

  puts frequencies(content)
end

让我们看看你在做什么:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))

doc.css('b').remove
text = doc.css('pre')
text 
# => [#<Nokogiri::XML::Element:0x3ff6686df65c name="pre" children=[#<Nokogiri::XML::Text:0x3ff6686df440 "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686def7c "\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686deb1c "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de694 "\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de20c ...

text.to_s 
# => "<pre>\r\n\r\n\r\n\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n    lined residential street. Note the small apartment complex\r\n    set back from the curb.\r\n\r\n\r\n    Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n               This is where...

text.to_s.scan(/\w+/) 
# => ["pre", "Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "H...

您正在捕获标签、这些参数的参数,以及作为节点集(也称为节点数组)嵌入的文本。我认为你不想那样做。

相反,我会这样做:

require 'nokogiri'
require 'open-uri'

def frequencies(content)
  Hash[
    content.group_by(&:downcase).map{ |word, instances|
      [word,instances.length]
      }.sort_by(&:last).reverse
    ]
end

doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))

doc.css('b').remove
text = doc.css('pre').map(&:text)
text 
# => ["\r\n\r\n\r\n\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n    lined residential street. Note the small apartment complex\r\n    set back from the curb.\r\n\r\n\r\n    Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n               This is where whe...

text.join(' ')
# => "\r\n\r\n\r\n\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n    lined residential street. Note the small apartment complex\r\n    set back from the curb.\r\n\r\n\r\n    Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n               This is where wher...

content = text.join(' ').scan(/\w+/) 
# => ["Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "His", "w...

frequencies(content)
# => {"the"=>827, "to"=>486, "i"=>398, "a"=>397, "s"=>284, "and"=>279, "in"=>273, "of"=>238, "hannah"=>234, "you"=>232, "henry"=>223, "it"=>214, "on"=>207, "her"=>200, "is"=>192, "his"=>178, "he"=>165, "for"=>162, "t"=>152, "that"=>151, "colette"=>148, "she"=>142, "at"=>137, "john"=>133, "alan"=>118, "this"=>112, "my"=>109, "up"=>105, "all"=>88, "william"=>88, "as"=>85, "what"=>84, "with"=>84, "but"=>83, "be"=>76, "camera"=>76, "not"=>74, "one"=>74, "can"=>73, "out"=>70, "m"=>69, "from"=>...

我插入了一些额外的步骤,以便您可以更轻松地查看返回的内容。你可以忽略那些。

想法是忽略标签,除了使用它们来抓取文本内容,这就是 map(&:text) 所做的。

注意事项:

  • \w 并不意味着 [a-z0-9],它意味着 [a-z0-9_] 匹配变量名,而不是我们认为的典型单词。
  • 纯数字值(例如“14”和“2012”)不必要地使结果混乱。使用 reject 删除全数字条目可能会很好,因为在确定关键字等时这些条目通常不是很有用。