用于在文本中查找特定模式的 C# 正则表达式

Question

我正在尝试编写一个程序，可以用任何需要的翻译替换文档中的圣经经文。这对于包含大量 KJV 参考经文的旧书很有用。这个过程中最困难的部分是想出一种方法来提取文档中的经文。

我发现大多数在正文中放置圣经经文的书籍都使用类似 "N"(BookName chapter#:verse#s) 的结构，其中 N 是经文文本，引文是字面意思，括号也是字面意思。我一直在想出一个正则表达式来匹配文本中的这些问题。

我尝试使用的最新正则表达式是：\"(.+)\"\s*\(([\w. ]+[0-9\s]+[:][\s0-9\-]+.*)\)。我遇到了找不到所有匹配项的问题。

这是它的 regex101 示例。 https://regex101.com/r/eS5oT8/1

有没有办法用正则表达式解决这个问题？任何帮助或建议将不胜感激。

Answer 1

使用 "g" 修饰符。

g modifier: global. All matches (don't return on first match)

见Regex Demo

Answer 2

您可以尝试使用 MSDN 中给出的示例，这里是 link

https://msdn.microsoft.com/en-us/library/0z2heewz(v=vs.110).aspx

使用系统；使用 System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string input = "ablaze beagle choral dozen elementary fanatic " +
                     "glaze hunger inept jazz kitchen lemon minus " +
                     "night optical pizza quiz restoration stamina " +
                     "train unrest vertical whiz xray yellow zealous";
      string pattern = @"\b\w*z+\w*\b";
      Match m = Regex.Match(input, pattern);
      while (m.Success) {
         Console.WriteLine("'{0}' found at position {1}", m.Value, m.Index);
         m = m.NextMatch();
      }   
   }
}
// The example displays the following output:
//    'ablaze' found at position 0
//    'dozen' found at position 21
//    'glaze' found at position 46
//    'jazz' found at position 65
//    'pizza' found at position 104
//    'quiz' found at position 110
//    'whiz' found at position 157
//    'zealous' found at position 174

Answer 3

添加 "g" 后，如果有多个经文之间没有任何 '\n' 字符，也要小心，因为 "(.*)" 会将它们视为一个长匹配而不是多节经文。你会想要像 "([^"]*)" 这样的东西来防止这种情况发生。

Answer 4

值得一提的是，您用来测试的站点依赖于 Javascript 正则表达式，这需要明确定义 g 修饰符，这与 C#（默认情况下是全局的）不同.

您可以稍微调整一下表达式并确保正确转义双引号：

// Updated expression with escaped double-quotes and other minor changes
var regex = new Regex(@"\""([^""]+)\""\s*\(([\w. ]+[\d\s]+[:][\s\d\-]+[^)]*)\)");

然后使用 Regex.Matches() 方法查找字符串中的所有匹配项：

// Find each of the matches and output them
foreach(Match m in regex.Matches(input))
{
     // Output each match here (using Console Example)
     Console.WriteLine(m.Value);
}

您可以在 this working example 中看到它的实际效果，示例输出如下：

Answer 5

如何以此为指导开始：

(?<quote>"".+"")          # a series of any characters in quotes 
\s +                      # followed by spaces
\(                        # followed by a parenthetical expression
   (?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
   (?<chapter>\d+)        # chapter e.g. the '1' in 1:2
   :                      # semicolon
   (?<verse>\d+)          # verse e.g. the '2' in 1:2
\)

使用选项：

RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase

上面的表达式将为您提供匹配中每个元素的命名捕获，以便于解析（例如，您可以通过查看例如 match.Groups["verse"].

完整代码：

var input = @"Jesus said, ""'Love your neighbor as yourself.' 
            There is no commandment greater than these"" (Mark 12:31).";

var bibleQuotesRegex =
    @"(?<quote>"".+"")              # a series of any characters in quotes 
    \s +                            # followed by spaces
    \(                              # followed by a parenthetical expression
            (?<book>\d*[a-z.\s] *)  # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
            (?<chapter>\d+)         # chapter e.g. the '1' in 1:2
            :                       # semicolon
            (?<verse>\d+)           # verse e.g. the '2' in 1:2
    \)";
foreach(Match match in  Regex.Matches(input, bibleQuotesRegex, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase))
{
    var bibleQuote = new
    {
        Quote = match.Groups["quote"].Value,
        Book = match.Groups["book"].Value,
        Chapter = int.Parse(match.Groups["chapter"].Value),
        Verse = int.Parse(match.Groups["verse"].Value)
    };

    //do something with it.
}

用于在文本中查找特定模式的 C# 正则表达式

C# regular expression for finding a certain pattern in a text

c#

regex

text-extraction