将文本文件中的每个句子放入数组中,但检测 headers/titles

Place every sentence from a text file into an array but detect headers/titles

我需要将文本中的每个句子 document/string 放入数组中。

问题在于如何处理 headers、标题等不属于句子的文本部分,但不要以句号“.”结尾来检测。 无法检测到这些将导致它们被卡在下面句子的前面(如果我使用“。”来区分句子)这是我不可能发生的。

最初我打算使用:

contentRefined = content.Replace(" \n", ". ");

我认为这会删除所有空行和换行符,并在 headers 的末尾放置句号以被检测并视为句子,这将导致“. .”但是我可以再次用任何东西替换它们。

但是没有用,它只是留下了完整的空行,只是在空行的开头放了一个“.”....以及在每个段落的开头放了一个“.”

我已经试过了:

contentRefined = Regex.Replace(content, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);

这完全删除了完整的空行,但并没有让我更接近于在 headers.

的末尾添加句号

我需要将句子和 headers/titles 放在一个数组中,我不确定是否有一种方法可以做到这一点而不必用诸如“。”之类的东西来分割字符串。

编辑:显示我如何从文件中获取测试的完整当前代码

 public void sentenceSplit()
    {
        content = File.ReadAllText(@"I:\Project\TLDR\Test Text.txt");
        contentRefined = Regex.Replace(content, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
        //contentRefined = content.Replace("\n", ". ");
    }

我假设 'Header' 和 'Title' 在他们自己的线上并且不以句点结束。

如果是这样,那么这可能对你有用:

var filePath = @"C:\Temp\temp.txt";
var sentences = new List<string>();

using (TextReader reader = new StreamReader(filePath))
{
    while (reader.Peek() >= 0)
    {
        var line = reader.ReadLine();

        if (line.Trim().EndsWith("."))
        {
            line.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries)
                .ToList()
                .ForEach(l => sentences.Add(l.Trim() + "."));
        }
    }
}

// Output sentences to console
sentences.ForEach(Console.WriteLine);

更新

另一种方法使用 File.ReadAllLines() 方法,并在 RichTextBox:

中显示句子
private void Form1_Load(object sender, EventArgs e)
{
    var filePath = @"C:\Temp\temp.txt";

    var sentences = File.ReadAllLines(filePath)
        // Only select lines that end in a period
        .Where(l => l.Trim().EndsWith("."))
        // Split each line into sentences (one line may have many sentences)
        .SelectMany(s => s.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries))
        // Trim any whitespace off the ends of the sentence and add a period to the end
        .Select(s => s.Trim() + ".")
        // And finally cast it to a List (or you could do 'ToArray()')
        .ToList();

    // To show each sentence in the list on it's own line in the rtb:
    richTextBox1.Text = string.Join("\n", sentences);

    // Or to show them all, one after another:
    richTextBox1.Text = string.Join(" ", sentences);
}

更新

现在我想我明白你的问题了,这就是我要做的。首先,我会创建一些 类 来管理所有这些东西。如果将文档分解成多个部分,您会得到如下内容:

HEADER

Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".

Header Over an Empty Section

Header over multiple paragraphs

Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".

Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".

Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".

所以我会创建以下 类。首先,一个代表一个'Section'。这是由 Header 和零到多个段落定义的:

private class Section
{
    public string Header { get; set; }

    public List<Paragraph> Paragraphs { get; set; }

    public Section()
    {
        Paragraphs = new List<Paragraph>();
    }
}

然后我会定义一个段落,其中包含一个或多个句子:

private class Paragraph
{
    public List<string> Sentences { get; set; }

    public Paragraph()
    {
        Sentences = new List<string>();
    }
}

现在我可以填充部分列表来表示文档:

var filePath = @"C:\Temp\temp.txt";

var sections = new List<Section>();
var currentSection = new Section();
var currentParagraph = new Paragraph();

using (TextReader reader = new StreamReader(filePath))
{
    while (reader.Peek() >= 0)
    {
        var line = reader.ReadLine().Trim();

        // Ignore blank lines
        if (string.IsNullOrWhiteSpace(line)) continue;

        if (line.EndsWith("."))
        {
            // This line is a paragraph, so add all the sentences
            // it contains to the current paragraph
            line.Split(new[] {". "}, StringSplitOptions.RemoveEmptyEntries)
                .Select(l => l.Trim().EndsWith(".") ? l.Trim() : l.Trim() + ".")
                .ToList()
                .ForEach(l => currentParagraph.Sentences.Add(l));

            // Now add this paragraph to the current section
            currentSection.Paragraphs.Add(currentParagraph);

            // And set it to a new paragraph for the next loop
            currentParagraph = new Paragraph();
        }
        else if (line.Length > 0)
        {
            // This line is a header, so we're starting a new section.
            // Add the current section to our list and create a 
            // a new one, setting this line as the header.
            sections.Add(currentSection);
            currentSection = new Section {Header = line};
        }
    }

    // Finally, if the current section contains any data, add it to the list
    if (currentSection.Header.Length > 0 || currentSection.Paragraphs.Any())
    {
        sections.Add(currentSection);
    }
}

现在我们在章节列表中有了整个文档,并且我们知道顺序、headers、段落以及它们包含的句子。作为如何分析它的示例,这里有一种将其写回 RichTextBox:

的方法
// We can build the document section by section
var documentText = new StringBuilder();

foreach (var section in sections)
{
    // Here we can display headers and paragraphs in a custom way.
    // For example, we can separate all sections with a blank line:
    documentText.AppendLine();

    // If there is a header, we can underline it
    if (!string.IsNullOrWhiteSpace(section.Header))
    {
        documentText.AppendLine(section.Header);
        documentText.AppendLine(new string('-', section.Header.Length));
    }

    // We can mark each paragraph with an arrow (--> )
    foreach (var paragraph in section.Paragraphs)
    {
        documentText.Append("--> ");

        // And write out each sentence, separated by a space
        documentText.AppendLine(string.Join(" ", paragraph.Sentences));
    }
}

// To make the underline approach above look
// half-way decent, we need a fixed-width font
richTextBox1.Font = new Font(FontFamily.GenericMonospace, 9);

// Now set the RichTextBox Text equal to the StringBuilder Text
richTextBox1.Text = documentText.ToString();