如何链接内部文本文档交叉引用？

Question

我正在制作一个带有纯文本的文档集合的网络版本，如下所示：

...as found in article 6, depending on...

我正在编写代码来添加相对 URL 锚点（linkify）：

...as found in <a href="article_6">article 6</a>, depending on...

我对任何编程语言都持开放态度，目前有 Ruby + 处理这种简单情况的正则表达式代码：

    with_single_article_links = html.gsub(/(article \d+)/i) do
      last_match = Regexp.last_match
      "<a href=\"last_match.gsub(' ', '_')\">#{last_match}</a>"
    end

但我正在寻找处理此类更复杂案例的想法，其中有多个引用：

...在文章 6 或 7 中找到，取决于...
...在文章 6、7 或 8 中找到，具体取决于...
...在第 6、7 或 8 条之二 中找到，取决于...

如果我继续使用我当前的代码，我可能会有两级正则表达式：第一级匹配 article \d+，然后第二级检查这些复杂情况之一。

但是我可以采取其他方法吗？我对任何编程语言和技术都持开放态度。这基本上是对我的现实检查，我正在使用一种体面的方法。

更新： 扩展正则表达式，目前有效：

article (\d+)((, \d+)* or (\d+))?

实时取景：https://regex101.com/r/WHtM5C/1

第二组只需要对逗号分隔列表进行一些简单的解析。

Answer 1

我知道这看起来有点矫枉过正而且非常冗长，但首先想到的是使用构建器模式，将您的输入拆分为标记，然后根据您在其中的位置转换每个标记流。

input = "as found in article 6 or 7, depending on\nas found in article 6, 7 or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"

class TextReader
  attr_reader :builder, :text

  def initialize(text, builder)
    @text = text
    @builder = builder
  end

  def parse()
    stream = text.split(/(?=\s|,)/)
    stream.each do |token|
      case token
      when /^\s+$/
        builder.convert_space(token)
      when /^\s*,$/, /^\s+or$/
        builder.convert_joiner(token)
      when /^\s*\d+$/
        builder.convert_number(token)
      when /^\s*as$/
        builder.convert_as(token)
      when /^\s*found$/
        builder.convert_found(token)
      when /^\s*in$/
        builder.convert_in(token)
      when /^\s*article$/
        builder.convert_article(token)
      else
        builder.convert_other(token)
      end
    end
  end
end

class HTMLBuilder
  attr_reader :html

  def initialize()
    @html = ""
  end

  def convert_space(token)
    html << token
  end

  def convert_joiner(token)
    @joiner = true
    html << token
  end

  def convert_other(token)
    @as = @found = @in = @article = @joiner = false
    html << token
  end

  def convert_number(token)
    token =~ /^\s*(\d+)/
    if @article
      if @joiner
        html << " <a href=\"article_#{}\" #{}>"
      else
        html << " <a href=\"article_#{}\" article #{}>"
      end
    else
      html << token
    end
  end

  def convert_as(token)
    @as = true
    html << token
  end

  def convert_found(token)
    @found = true if @as
    html << token
  end

  def convert_in(token)
    @in = true if @found
    html << token
  end

  def convert_article(token)
    @article = true if @in
  end
end

builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html


=>
output:
as found in <a href="article_6" article 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8>, depending on
as found in <a href="article_6" article 6>, <a href="article_7" 7> or <a href="article_8" 8> bis, depending on

Answer 2

我添加了第二个答案，因为我不想在第一个答案被投票后进行任何重大更改。

如您所见，这是一种状态机，因此您可以在第一次看到数字时开始“构建”一个数字，然后在到达表示您已到达数字末尾的标记时完成该数字数定义。如果数字构建变得复杂，您甚至可以启动嵌套构建器，即 NumberBuilder 并向其发送令牌，直到到达数字定义的末尾，然后向构建器询问数字。

input = "as found in article 6 or 7, depending on\nas found in article 6, 7 bis or 8, depending on\nas found in article 6, 7 or 8 bis, depending on"

class TextReader
  attr_reader :builder, :text

  def initialize(text, builder)
    @text = text
    @builder = builder
  end

  def parse()
    stream = text.split(/(?=\s|,)/)
    stream.each do |token|
      case token
      when /^\s+$/
        builder.convert_space(token)
      when /^\s*,$/, /^\s+or$/
        builder.convert_joiner(token)
      when /^\s*\d+$/
        builder.convert_digits(token)
      when /^\s*as$/
        builder.convert_as(token)
      when /^\s*found$/
        builder.convert_found(token)
      when /^\s*in$/
        builder.convert_in(token)
      when /^\s*article$/
        builder.convert_article(token)
      when /^\s*bis$/
        builder.convert_bis(token)
      else
        builder.convert_other(token)
      end
    end
  end
end

class HTMLBuilder
  attr_reader :html

  def initialize()
    @html = ""
  end

  def convert_space(token)
    html << token
  end

  def convert_joiner(token)
    @joiner = true
    process_number if @number
    html << token
  end

  def convert_other(token)
    process_number if @number
    @as = @found = @in = @article = @joiner = @number = false
    html << token
  end

  def convert_digits(token)
    @number = token   
  end

  def convert_bis(token)
    if @number 
        @number << token
        process_number
    else
        html << token
    end
  end

  def process_number()
    token = @number
    @number = false
    token =~ /^\s*(\d+)(.+)*/
    if @article
      if @joiner
        html << " <a href=\"article_#{}#{}\" #{}#{}>"
      else
        html << " <a href=\"article_#{}#{}\" article #{}#{}>"
      end
    else
      html << token
    end
  end

  def convert_as(token)
    @as = true
    html << token
  end

  def convert_found(token)
    @found = true if @as
    html << token
  end

  def convert_in(token)
    @in = true if @found
    html << token
  end

  def convert_article(token)
    @article = true if @in
  end
end

builder = HTMLBuilder.new
reader = TextReader.new(input, builder)
reader.parse
puts "output:"
puts builder.html

=>
output:
as found in <a href="article_6" 6> or <a href="article_7" 7>, depending on
as found in <a href="article_6" 6>, <a href="article_7 bis" 7 bis> or <a href="article_8" 8>, depending on
as found in <a href="article_6" 6>, <a href="article_7" 7> or <a href="article_8 bis" 8 bis>, depending on

如何链接内部文本文档交叉引用？

How to linkify internal text document cross-references?

ruby

regex

parsing