使用 Nokogiri 读取和统计单词输出:Undefined Method
Using Nokogiri to read and count word output: Undefined Method
我正在使用 nokogiri 输出电影剧本,我希望能够对该输出进行字数统计。
我已经改编了“Getting viewable text words via Nokogiri”的答案,但是当 运行 时,我在这一行中收到 ActionController::RoutingError (undefined method 'frequencies')
错误:
puts frequencies(content)
这是我正在 运行 编写的代码,而且我对 Rails 还是很陌生,但是我已尽力清理代码以提高可读性:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'pp'
class NokogiriController < ApplicationController
page = 'http://www.imsdb.com/scripts/Authors-Anonymous.html'
doc = Nokogiri::HTML(open(page))
text = doc.css('b').remove
text = doc.css('pre')
content = text.to_s.scan(/\w+/)
puts content.length, content.uniq.length, content.uniq.sort[0..8]
def frequencies(content)
Hash[
content.group_by(&:downcase).map{ |word, instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
puts frequencies(content)
end
让我们看看你在做什么:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))
doc.css('b').remove
text = doc.css('pre')
text
# => [#<Nokogiri::XML::Element:0x3ff6686df65c name="pre" children=[#<Nokogiri::XML::Text:0x3ff6686df440 "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686def7c "\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686deb1c "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de694 "\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de20c ...
text.to_s
# => "<pre>\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where...
text.to_s.scan(/\w+/)
# => ["pre", "Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "H...
您正在捕获标签、这些参数的参数,以及作为节点集(也称为节点数组)嵌入的文本。我认为你不想那样做。
相反,我会这样做:
require 'nokogiri'
require 'open-uri'
def frequencies(content)
Hash[
content.group_by(&:downcase).map{ |word, instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))
doc.css('b').remove
text = doc.css('pre').map(&:text)
text
# => ["\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where whe...
text.join(' ')
# => "\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where wher...
content = text.join(' ').scan(/\w+/)
# => ["Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "His", "w...
frequencies(content)
# => {"the"=>827, "to"=>486, "i"=>398, "a"=>397, "s"=>284, "and"=>279, "in"=>273, "of"=>238, "hannah"=>234, "you"=>232, "henry"=>223, "it"=>214, "on"=>207, "her"=>200, "is"=>192, "his"=>178, "he"=>165, "for"=>162, "t"=>152, "that"=>151, "colette"=>148, "she"=>142, "at"=>137, "john"=>133, "alan"=>118, "this"=>112, "my"=>109, "up"=>105, "all"=>88, "william"=>88, "as"=>85, "what"=>84, "with"=>84, "but"=>83, "be"=>76, "camera"=>76, "not"=>74, "one"=>74, "can"=>73, "out"=>70, "m"=>69, "from"=>...
我插入了一些额外的步骤,以便您可以更轻松地查看返回的内容。你可以忽略那些。
想法是忽略标签,除了使用它们来抓取文本内容,这就是 map(&:text)
所做的。
注意事项:
\w
并不意味着 [a-z0-9]
,它意味着 [a-z0-9_]
匹配变量名,而不是我们认为的典型单词。
- 纯数字值(例如“14”和“2012”)不必要地使结果混乱。使用
reject
删除全数字条目可能会很好,因为在确定关键字等时这些条目通常不是很有用。
我正在使用 nokogiri 输出电影剧本,我希望能够对该输出进行字数统计。
我已经改编了“Getting viewable text words via Nokogiri”的答案,但是当 运行 时,我在这一行中收到 ActionController::RoutingError (undefined method 'frequencies')
错误:
puts frequencies(content)
这是我正在 运行 编写的代码,而且我对 Rails 还是很陌生,但是我已尽力清理代码以提高可读性:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'pp'
class NokogiriController < ApplicationController
page = 'http://www.imsdb.com/scripts/Authors-Anonymous.html'
doc = Nokogiri::HTML(open(page))
text = doc.css('b').remove
text = doc.css('pre')
content = text.to_s.scan(/\w+/)
puts content.length, content.uniq.length, content.uniq.sort[0..8]
def frequencies(content)
Hash[
content.group_by(&:downcase).map{ |word, instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
puts frequencies(content)
end
让我们看看你在做什么:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))
doc.css('b').remove
text = doc.css('pre')
text
# => [#<Nokogiri::XML::Element:0x3ff6686df65c name="pre" children=[#<Nokogiri::XML::Text:0x3ff6686df440 "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686def7c "\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686deb1c "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de694 "\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de20c ...
text.to_s
# => "<pre>\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where...
text.to_s.scan(/\w+/)
# => ["pre", "Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "H...
您正在捕获标签、这些参数的参数,以及作为节点集(也称为节点数组)嵌入的文本。我认为你不想那样做。
相反,我会这样做:
require 'nokogiri'
require 'open-uri'
def frequencies(content)
Hash[
content.group_by(&:downcase).map{ |word, instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))
doc.css('b').remove
text = doc.css('pre').map(&:text)
text
# => ["\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where whe...
text.join(' ')
# => "\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where wher...
content = text.join(' ').scan(/\w+/)
# => ["Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "His", "w...
frequencies(content)
# => {"the"=>827, "to"=>486, "i"=>398, "a"=>397, "s"=>284, "and"=>279, "in"=>273, "of"=>238, "hannah"=>234, "you"=>232, "henry"=>223, "it"=>214, "on"=>207, "her"=>200, "is"=>192, "his"=>178, "he"=>165, "for"=>162, "t"=>152, "that"=>151, "colette"=>148, "she"=>142, "at"=>137, "john"=>133, "alan"=>118, "this"=>112, "my"=>109, "up"=>105, "all"=>88, "william"=>88, "as"=>85, "what"=>84, "with"=>84, "but"=>83, "be"=>76, "camera"=>76, "not"=>74, "one"=>74, "can"=>73, "out"=>70, "m"=>69, "from"=>...
我插入了一些额外的步骤,以便您可以更轻松地查看返回的内容。你可以忽略那些。
想法是忽略标签,除了使用它们来抓取文本内容,这就是 map(&:text)
所做的。
注意事项:
\w
并不意味着[a-z0-9]
,它意味着[a-z0-9_]
匹配变量名,而不是我们认为的典型单词。- 纯数字值(例如“14”和“2012”)不必要地使结果混乱。使用
reject
删除全数字条目可能会很好,因为在确定关键字等时这些条目通常不是很有用。