使用 Regex（.NET Framework、C#）删除所有以“**”开头的行（注释）

Question

我正在开发一个读取和处理文本文件的应用程序。这些文本文件具有以下结构：

** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6

我使用 string text = File.ReadAllText(file); 将整个文本文件存储在内存中。但是，我想删除所有作为注释的行，即所有以 "**".

开头的行

这可以通过以下方法实现：

// this method also removes any white-spaces (this is intended)
string RemoveComments(string textWithComments)
{
    string textWithoutComments = null;

    string[] split = Regex.Split(text.Replace(" ", null), "\r\n|\r|\n").ToArray();
    foreach (string line in split)
        if (line.Length >= 2 && line[0] == '*' && line[1] == '*') continue;
        else textWithoutComments += line + "\r\n";

    return textWithoutComments;
}

然而，对于大文件来说，这实际上非常慢。我还认为可以用一行代码替换整个方法（可能通过使用正则表达式）。我怎样才能做到这一点（我也从未使用过正则表达式）。

PS：我也想避免StreamReaders.

编辑

示例文件如下所示：

** Initial comment
*Command-0
** Some Comment: Header: Text
** Some text: text
*Command-1
**
** Some comment or text
**
*Command-2
*Command-3
      1,            2,            3
      2,            2,            4
      3,            2,            5
** END COMMENT

Answer 1

连接字符串会在每次字符串大小发生变化时重新分配内存。

StringBuilder 不会经常重新分配，并且会显着减少*运行时间

string RemoveComments(string textWithComments)
{
    StringBuilder textWithoutComments = new StringBuilder();

    string[] split = text.Replace(" ", null).Split('\r', '\n');
    foreach (string line in split)
        if (line.Length >= 2 && line[0] == '*' && line[1] == '*') continue;
        else textWithoutComments.Append(line + "\r\n");

    return textWithoutComments.ToString();
}

在阿鸾的建议下编辑

Answer 2

为什么不只是：

var text = @"** A comment
* A command
Data, data, data
** Some other comment
* Another command
1, 2, 3
4, 5, 6";

var textWithoutComments = Regex.Replace(text, @"(^|\n)\*\*.*(?=\n)", string.Empty); //this version will leave a \n at the beginning of the string if the text starts with a comment.
var textWithoutComments = Regex.Replace(text, @"(^\*\*.*\r\n)|((\r\n)\*\*.*($|(?=\r\n)))", string.Empty); //this versioh deals with that problem, for a longer regex that treats the first line differently than the other lines (consumes the \n rather than leaving it in the text)

不知道性能，我没有现成的测试数据...

PS：我也倾向于相信，如果你想要最佳性能，一些流式传输可能是理想的，你总是可以 return 来自方法的字符串，如果这会让以后的事情变得更容易的话加工。我认为此线程中的大多数人都建议将 StreamReader 用于 iteration/reading/interpreting 部分，无论您决定构建哪种 return 类型。

Answer 3

我知道你说过你不想使用 StreamReader，但下面的代码在我的电脑上可以在不到半秒的时间内处理 400,000 行。很简单，straight-forward而且很快。

static void RemoveCommentsAndWhitespace(string filePath)
{
    if (!File.Exists(filePath))
    {
        Console.WriteLine($"ERR: The file '{filePath}' does not exist.", nameof(filePath));
    }

    string outfile = filePath + ".out";

    using StreamReader sr = new StreamReader(filePath);
    using StreamWriter sw = new StreamWriter(outfile);
    string line;

    while ((line = sr.ReadLine()) != null)
    {
        string tmp = line.Replace(" ", string.Empty);
        if (tmp.StartsWith("**"))
        {
            continue;
        }

        sw.WriteLine(tmp);
    }

    Console.WriteLine($"Wrote to {outfile}.");
}

使用 Regex（.NET Framework、C#）删除所有以“**”开头的行（注释）

Remove all lines (comments) starting with "**" by using Regex (.NET Framework, C#)

.net

c#

regex

.net-4.8