从 Rails 中的 HTML 个段落创建文本摘录
Create text excerpt from HTML paragraphs in Rails
我正在尝试提取一篇文章的摘录(markdown 解析为 HTML),其中仅包含段落中的纯文本。所有 HTML 都需要删除,换行符、制表符和顺序白色 space 需要替换为单个 space.
我的第一步是创建一个简单的测试:
describe "#from_html" do
it "creates an excerpt from given HTML" do
html = "<p>The spice extends <b>life</b>.<br>The spice expands consciousness.</p>\n
<ul><li>Skip me</li></ul>\n
<p>The <i>spice</i> is vital to space travel.</p>"
text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."
expect(R::ExcerptHelper.from_html(html)).to eq(text)
end
end
然后开始摆弄并想出了这个:
def from_html(html)
Nokogiri::HTML.parse(html).css("p").map{|node|
node.children.map{|child|
child.name == "br" ? child.replace(" ") : child
} << " "
}.join.strip.gsub(/\s+/, " ")
end
我对 Rails 有点生疏,这可能会更高效、更优雅地完成。我希望在这里得到一些指示。
提前致谢!
方法二
转向sanitize method (thanks @max) and writing a custom scrubber based on Rails::Html::PermitScrubber
方法 3
意识到我的源文档格式为 Markdown,我开始探索自定义 Redcarpet 渲染器。
查看我的 以获得完整示例。
我最终写了一个自定义 Redcarpet renderer (inspired by Redcarpet::Render::StripDown
)。这似乎是最简洁的方法,格式之间的解析和转换最少。
module R::Markdown
class ExcerptRenderer < Redcarpet::Render::Base
# Methods where the first argument is the text content
[
# block-level calls
:paragraph,
# span-level calls
:codespan, :double_emphasis,
:emphasis, :underline, :raw_html,
:triple_emphasis, :strikethrough,
:superscript, :highlight, :quote,
# footnotes
:footnotes, :footnote_def, :footnote_ref,
# low level rendering
:entity, :normal_text
].each do |method|
define_method method do |*args|
args.first
end
end
# Methods where content is replaced with an empty space
[
:autolink, :block_html
].each do |method|
define_method method do |*|
" "
end
end
# Methods we are going to [snip]
[
:list, :image, :table, :block_code
].each do |method|
define_method method do |*|
" [#{method}] "
end
end
# Other methods
def link(link, title, content)
content
end
def header(text, header_level)
" #{text} "
end
def block_quote(quote)
" “#{quote}” "
end
# Replace all whitespace with single space
def postprocess(document)
document.gsub(/\s+/, " ").strip
end
end
end
并解析它:
extensions = {
autolink: true,
disable_indented_code_blocks: true,
fenced_code_blocks: true,
lax_spacing: true,
no_intra_emphasis: true,
strikethrough: true,
superscript: true,
tables: true
}
markdown = Redcarpet::Markdown.new(R::Markdown::ExcerptRenderer, extensions)
markdown.render(md).html_safe
我正在尝试提取一篇文章的摘录(markdown 解析为 HTML),其中仅包含段落中的纯文本。所有 HTML 都需要删除,换行符、制表符和顺序白色 space 需要替换为单个 space.
我的第一步是创建一个简单的测试:
describe "#from_html" do
it "creates an excerpt from given HTML" do
html = "<p>The spice extends <b>life</b>.<br>The spice expands consciousness.</p>\n
<ul><li>Skip me</li></ul>\n
<p>The <i>spice</i> is vital to space travel.</p>"
text = "The spice extends life. The spice expands consciousness. The spice is vital to space travel."
expect(R::ExcerptHelper.from_html(html)).to eq(text)
end
end
然后开始摆弄并想出了这个:
def from_html(html)
Nokogiri::HTML.parse(html).css("p").map{|node|
node.children.map{|child|
child.name == "br" ? child.replace(" ") : child
} << " "
}.join.strip.gsub(/\s+/, " ")
end
我对 Rails 有点生疏,这可能会更高效、更优雅地完成。我希望在这里得到一些指示。
提前致谢!
方法二
转向sanitize method (thanks @max) and writing a custom scrubber based on Rails::Html::PermitScrubber
方法 3
意识到我的源文档格式为 Markdown,我开始探索自定义 Redcarpet 渲染器。
查看我的
我最终写了一个自定义 Redcarpet renderer (inspired by Redcarpet::Render::StripDown
)。这似乎是最简洁的方法,格式之间的解析和转换最少。
module R::Markdown
class ExcerptRenderer < Redcarpet::Render::Base
# Methods where the first argument is the text content
[
# block-level calls
:paragraph,
# span-level calls
:codespan, :double_emphasis,
:emphasis, :underline, :raw_html,
:triple_emphasis, :strikethrough,
:superscript, :highlight, :quote,
# footnotes
:footnotes, :footnote_def, :footnote_ref,
# low level rendering
:entity, :normal_text
].each do |method|
define_method method do |*args|
args.first
end
end
# Methods where content is replaced with an empty space
[
:autolink, :block_html
].each do |method|
define_method method do |*|
" "
end
end
# Methods we are going to [snip]
[
:list, :image, :table, :block_code
].each do |method|
define_method method do |*|
" [#{method}] "
end
end
# Other methods
def link(link, title, content)
content
end
def header(text, header_level)
" #{text} "
end
def block_quote(quote)
" “#{quote}” "
end
# Replace all whitespace with single space
def postprocess(document)
document.gsub(/\s+/, " ").strip
end
end
end
并解析它:
extensions = {
autolink: true,
disable_indented_code_blocks: true,
fenced_code_blocks: true,
lax_spacing: true,
no_intra_emphasis: true,
strikethrough: true,
superscript: true,
tables: true
}
markdown = Redcarpet::Markdown.new(R::Markdown::ExcerptRenderer, extensions)
markdown.render(md).html_safe