如何链接内部文本文档交叉引用?
How to linkify internal text document cross-references?
我正在制作一个带有纯文本的文档集合的网络版本,如下所示:
...as found in article 6, depending on...
我正在编写代码来添加相对 URL 锚点(linkify):
...as found in <a href="article_6">article 6</a>, depending on...
我对任何编程语言都持开放态度,目前有 Ruby + 处理这种简单情况的正则表达式代码:
with_single_article_links = html.gsub(/(article \d+)/i) do
last_match = Regexp.last_match
"<a href=\"last_match.gsub(' ', '_')\">#{last_match}</a>"
end
但我正在寻找处理此类更复杂案例的想法,其中有多个引用:
- ...在文章 6 或 7 中找到,取决于...
- ...在文章 6、7 或 8 中找到,具体取决于...
- ...在第 6、7 或 8 条之二 中找到,取决于...
如果我继续使用我当前的代码,我可能会有两级正则表达式:第一级匹配 article \d+
,然后第二级检查这些复杂情况之一。
但是我可以采取其他方法吗?我对任何编程语言和技术都持开放态度。这基本上是对我的现实检查,我正在使用一种体面的方法。
更新: 扩展正则表达式,目前有效:
article (\d+)((, \d+)* or (\d+))?
实时取景:https://regex101.com/r/WHtM5C/1
第二组只需要对逗号分隔列表进行一些简单的解析。
我知道这看起来有点矫枉过正而且非常冗长,但首先想到的是使用构建器模式,将您的输入拆分为标记,然后根据您在其中的位置转换每个标记流。
input = "as found in article 6 or 7, depending on\nas found in article 6, 7 or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"
class TextReader
attr_reader :builder, :text
def initialize(text, builder)
@text = text
@builder = builder
end
def parse()
stream = text.split(/(?=\s|,)/)
stream.each do |token|
case token
when /^\s+$/
builder.convert_space(token)
when /^\s*,$/, /^\s+or$/
builder.convert_joiner(token)
when /^\s*\d+$/
builder.convert_number(token)
when /^\s*as$/
builder.convert_as(token)
when /^\s*found$/
builder.convert_found(token)
when /^\s*in$/
builder.convert_in(token)
when /^\s*article$/
builder.convert_article(token)
else
builder.convert_other(token)
end
end
end
end
class HTMLBuilder
attr_reader :html
def initialize()
@html = ""
end
def convert_space(token)
html << token
end
def convert_joiner(token)
@joiner = true
html << token
end
def convert_other(token)
@as = @found = @in = @article = @joiner = false
html << token
end
def convert_number(token)
token =~ /^\s*(\d+)/
if @article
if @joiner
html << " <a href=\"article_#{}\" #{}>"
else
html << " <a href=\"article_#{}\" article #{}>"
end
else
html << token
end
end
def convert_as(token)
@as = true
html << token
end
def convert_found(token)
@found = true if @as
html << token
end
def convert_in(token)
@in = true if @found
html << token
end
def convert_article(token)
@article = true if @in
end
end
builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html
=>
output:
as found in <a href="article_6" article 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8> bis, depending on
我添加了第二个答案,因为我不想在第一个答案被投票后进行任何重大更改。
如您所见,这是一种状态机,因此您可以在第一次看到数字时开始“构建”一个数字,然后在到达表示您已到达数字末尾的标记时完成该数字数定义。如果数字构建变得复杂,您甚至可以启动嵌套构建器,即 NumberBuilder 并向其发送令牌,直到到达数字定义的末尾,然后向构建器询问数字。
input = "as found in article 6 or 7, depending on\nas found in article 6, 7 bis or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"
class TextReader
attr_reader :builder, :text
def initialize(text, builder)
@text = text
@builder = builder
end
def parse()
stream = text.split(/(?=\s|,)/)
stream.each do |token|
case token
when /^\s+$/
builder.convert_space(token)
when /^\s*,$/, /^\s+or$/
builder.convert_joiner(token)
when /^\s*\d+$/
builder.convert_digits(token)
when /^\s*as$/
builder.convert_as(token)
when /^\s*found$/
builder.convert_found(token)
when /^\s*in$/
builder.convert_in(token)
when /^\s*article$/
builder.convert_article(token)
when /^\s*bis$/
builder.convert_bis(token)
else
builder.convert_other(token)
end
end
end
end
class HTMLBuilder
attr_reader :html
def initialize()
@html = ""
end
def convert_space(token)
html << token
end
def convert_joiner(token)
@joiner = true
process_number if @number
html << token
end
def convert_other(token)
process_number if @number
@as = @found = @in = @article = @joiner = @number = false
html << token
end
def convert_digits(token)
@number = token
end
def convert_bis(token)
if @number
@number << token
process_number
else
html << token
end
end
def process_number()
token = @number
@number = false
token =~ /^\s*(\d+)(.+)*/
if @article
if @joiner
html << " <a href=\"article_#{}#{}\" #{}#{}>"
else
html << " <a href=\"article_#{}#{}\" article #{}#{}>"
end
else
html << token
end
end
def convert_as(token)
@as = true
html << token
end
def convert_found(token)
@found = true if @as
html << token
end
def convert_in(token)
@in = true if @found
html << token
end
def convert_article(token)
@article = true if @in
end
end
builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html
=>
output:
as found in <a href="article_6" 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" 6>, <a href="article_7 bis" 7 bis> or <a href="article_8" 8>, depending on
as found in <a href="article_6" 6>, <a href="article_7" 7> or <a href="article_8 bis" 8 bis>, depending on
我正在制作一个带有纯文本的文档集合的网络版本,如下所示:
...as found in article 6, depending on...
我正在编写代码来添加相对 URL 锚点(linkify):
...as found in <a href="article_6">article 6</a>, depending on...
我对任何编程语言都持开放态度,目前有 Ruby + 处理这种简单情况的正则表达式代码:
with_single_article_links = html.gsub(/(article \d+)/i) do
last_match = Regexp.last_match
"<a href=\"last_match.gsub(' ', '_')\">#{last_match}</a>"
end
但我正在寻找处理此类更复杂案例的想法,其中有多个引用:
- ...在文章 6 或 7 中找到,取决于...
- ...在文章 6、7 或 8 中找到,具体取决于...
- ...在第 6、7 或 8 条之二 中找到,取决于...
如果我继续使用我当前的代码,我可能会有两级正则表达式:第一级匹配 article \d+
,然后第二级检查这些复杂情况之一。
但是我可以采取其他方法吗?我对任何编程语言和技术都持开放态度。这基本上是对我的现实检查,我正在使用一种体面的方法。
更新: 扩展正则表达式,目前有效:
article (\d+)((, \d+)* or (\d+))?
实时取景:https://regex101.com/r/WHtM5C/1
第二组只需要对逗号分隔列表进行一些简单的解析。
我知道这看起来有点矫枉过正而且非常冗长,但首先想到的是使用构建器模式,将您的输入拆分为标记,然后根据您在其中的位置转换每个标记流。
input = "as found in article 6 or 7, depending on\nas found in article 6, 7 or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"
class TextReader
attr_reader :builder, :text
def initialize(text, builder)
@text = text
@builder = builder
end
def parse()
stream = text.split(/(?=\s|,)/)
stream.each do |token|
case token
when /^\s+$/
builder.convert_space(token)
when /^\s*,$/, /^\s+or$/
builder.convert_joiner(token)
when /^\s*\d+$/
builder.convert_number(token)
when /^\s*as$/
builder.convert_as(token)
when /^\s*found$/
builder.convert_found(token)
when /^\s*in$/
builder.convert_in(token)
when /^\s*article$/
builder.convert_article(token)
else
builder.convert_other(token)
end
end
end
end
class HTMLBuilder
attr_reader :html
def initialize()
@html = ""
end
def convert_space(token)
html << token
end
def convert_joiner(token)
@joiner = true
html << token
end
def convert_other(token)
@as = @found = @in = @article = @joiner = false
html << token
end
def convert_number(token)
token =~ /^\s*(\d+)/
if @article
if @joiner
html << " <a href=\"article_#{}\" #{}>"
else
html << " <a href=\"article_#{}\" article #{}>"
end
else
html << token
end
end
def convert_as(token)
@as = true
html << token
end
def convert_found(token)
@found = true if @as
html << token
end
def convert_in(token)
@in = true if @found
html << token
end
def convert_article(token)
@article = true if @in
end
end
builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html
=>
output:
as found in <a href="article_6" article 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8> bis, depending on
我添加了第二个答案,因为我不想在第一个答案被投票后进行任何重大更改。
如您所见,这是一种状态机,因此您可以在第一次看到数字时开始“构建”一个数字,然后在到达表示您已到达数字末尾的标记时完成该数字数定义。如果数字构建变得复杂,您甚至可以启动嵌套构建器,即 NumberBuilder 并向其发送令牌,直到到达数字定义的末尾,然后向构建器询问数字。
input = "as found in article 6 or 7, depending on\nas found in article 6, 7 bis or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"
class TextReader
attr_reader :builder, :text
def initialize(text, builder)
@text = text
@builder = builder
end
def parse()
stream = text.split(/(?=\s|,)/)
stream.each do |token|
case token
when /^\s+$/
builder.convert_space(token)
when /^\s*,$/, /^\s+or$/
builder.convert_joiner(token)
when /^\s*\d+$/
builder.convert_digits(token)
when /^\s*as$/
builder.convert_as(token)
when /^\s*found$/
builder.convert_found(token)
when /^\s*in$/
builder.convert_in(token)
when /^\s*article$/
builder.convert_article(token)
when /^\s*bis$/
builder.convert_bis(token)
else
builder.convert_other(token)
end
end
end
end
class HTMLBuilder
attr_reader :html
def initialize()
@html = ""
end
def convert_space(token)
html << token
end
def convert_joiner(token)
@joiner = true
process_number if @number
html << token
end
def convert_other(token)
process_number if @number
@as = @found = @in = @article = @joiner = @number = false
html << token
end
def convert_digits(token)
@number = token
end
def convert_bis(token)
if @number
@number << token
process_number
else
html << token
end
end
def process_number()
token = @number
@number = false
token =~ /^\s*(\d+)(.+)*/
if @article
if @joiner
html << " <a href=\"article_#{}#{}\" #{}#{}>"
else
html << " <a href=\"article_#{}#{}\" article #{}#{}>"
end
else
html << token
end
end
def convert_as(token)
@as = true
html << token
end
def convert_found(token)
@found = true if @as
html << token
end
def convert_in(token)
@in = true if @found
html << token
end
def convert_article(token)
@article = true if @in
end
end
builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html
=>
output:
as found in <a href="article_6" 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" 6>, <a href="article_7 bis" 7 bis> or <a href="article_8" 8>, depending on
as found in <a href="article_6" 6>, <a href="article_7" 7> or <a href="article_8 bis" 8 bis>, depending on