将文本文件中的每个句子放入数组中,但检测 headers/titles
Place every sentence from a text file into an array but detect headers/titles
我需要将文本中的每个句子 document/string 放入数组中。
问题在于如何处理 headers、标题等不属于句子的文本部分,但不要以句号“.”结尾来检测。
无法检测到这些将导致它们被卡在下面句子的前面(如果我使用“。”来区分句子)这是我不可能发生的。
最初我打算使用:
contentRefined = content.Replace(" \n", ". ");
我认为这会删除所有空行和换行符,并在 headers 的末尾放置句号以被检测并视为句子,这将导致“. .”但是我可以再次用任何东西替换它们。
但是没有用,它只是留下了完整的空行,只是在空行的开头放了一个“.”....以及在每个段落的开头放了一个“.”
我已经试过了:
contentRefined = Regex.Replace(content, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
这完全删除了完整的空行,但并没有让我更接近于在 headers.
的末尾添加句号
我需要将句子和 headers/titles 放在一个数组中,我不确定是否有一种方法可以做到这一点而不必用诸如“。”之类的东西来分割字符串。
编辑:显示我如何从文件中获取测试的完整当前代码
public void sentenceSplit()
{
content = File.ReadAllText(@"I:\Project\TLDR\Test Text.txt");
contentRefined = Regex.Replace(content, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
//contentRefined = content.Replace("\n", ". ");
}
我假设 'Header' 和 'Title' 在他们自己的线上并且不以句点结束。
如果是这样,那么这可能对你有用:
var filePath = @"C:\Temp\temp.txt";
var sentences = new List<string>();
using (TextReader reader = new StreamReader(filePath))
{
while (reader.Peek() >= 0)
{
var line = reader.ReadLine();
if (line.Trim().EndsWith("."))
{
line.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries)
.ToList()
.ForEach(l => sentences.Add(l.Trim() + "."));
}
}
}
// Output sentences to console
sentences.ForEach(Console.WriteLine);
更新
另一种方法使用 File.ReadAllLines()
方法,并在 RichTextBox
:
中显示句子
private void Form1_Load(object sender, EventArgs e)
{
var filePath = @"C:\Temp\temp.txt";
var sentences = File.ReadAllLines(filePath)
// Only select lines that end in a period
.Where(l => l.Trim().EndsWith("."))
// Split each line into sentences (one line may have many sentences)
.SelectMany(s => s.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries))
// Trim any whitespace off the ends of the sentence and add a period to the end
.Select(s => s.Trim() + ".")
// And finally cast it to a List (or you could do 'ToArray()')
.ToList();
// To show each sentence in the list on it's own line in the rtb:
richTextBox1.Text = string.Join("\n", sentences);
// Or to show them all, one after another:
richTextBox1.Text = string.Join(" ", sentences);
}
更新
现在我想我明白你的问题了,这就是我要做的。首先,我会创建一些 类 来管理所有这些东西。如果将文档分解成多个部分,您会得到如下内容:
HEADER
Paragraph sentence one. Paragraph sentence two. Paragraph
sentence three with a number, like in this quote: ".00 doesn't go as
far as it used to".
Header Over an Empty Section
Header over multiple paragraphs
Paragraph sentence one. Paragraph
sentence two. Paragraph sentence three with a number, like in this
quote: ".00 doesn't go as far as it used to".
Paragraph sentence one. Paragraph sentence two. Paragraph sentence
three with a number, like in this quote: ".00 doesn't go as far as
it used to".
Paragraph sentence one. Paragraph sentence two. Paragraph sentence
three with a number, like in this quote: ".00 doesn't go as far as
it used to".
所以我会创建以下 类。首先,一个代表一个'Section'。这是由 Header 和零到多个段落定义的:
private class Section
{
public string Header { get; set; }
public List<Paragraph> Paragraphs { get; set; }
public Section()
{
Paragraphs = new List<Paragraph>();
}
}
然后我会定义一个段落,其中包含一个或多个句子:
private class Paragraph
{
public List<string> Sentences { get; set; }
public Paragraph()
{
Sentences = new List<string>();
}
}
现在我可以填充部分列表来表示文档:
var filePath = @"C:\Temp\temp.txt";
var sections = new List<Section>();
var currentSection = new Section();
var currentParagraph = new Paragraph();
using (TextReader reader = new StreamReader(filePath))
{
while (reader.Peek() >= 0)
{
var line = reader.ReadLine().Trim();
// Ignore blank lines
if (string.IsNullOrWhiteSpace(line)) continue;
if (line.EndsWith("."))
{
// This line is a paragraph, so add all the sentences
// it contains to the current paragraph
line.Split(new[] {". "}, StringSplitOptions.RemoveEmptyEntries)
.Select(l => l.Trim().EndsWith(".") ? l.Trim() : l.Trim() + ".")
.ToList()
.ForEach(l => currentParagraph.Sentences.Add(l));
// Now add this paragraph to the current section
currentSection.Paragraphs.Add(currentParagraph);
// And set it to a new paragraph for the next loop
currentParagraph = new Paragraph();
}
else if (line.Length > 0)
{
// This line is a header, so we're starting a new section.
// Add the current section to our list and create a
// a new one, setting this line as the header.
sections.Add(currentSection);
currentSection = new Section {Header = line};
}
}
// Finally, if the current section contains any data, add it to the list
if (currentSection.Header.Length > 0 || currentSection.Paragraphs.Any())
{
sections.Add(currentSection);
}
}
现在我们在章节列表中有了整个文档,并且我们知道顺序、headers、段落以及它们包含的句子。作为如何分析它的示例,这里有一种将其写回 RichTextBox
:
的方法
// We can build the document section by section
var documentText = new StringBuilder();
foreach (var section in sections)
{
// Here we can display headers and paragraphs in a custom way.
// For example, we can separate all sections with a blank line:
documentText.AppendLine();
// If there is a header, we can underline it
if (!string.IsNullOrWhiteSpace(section.Header))
{
documentText.AppendLine(section.Header);
documentText.AppendLine(new string('-', section.Header.Length));
}
// We can mark each paragraph with an arrow (--> )
foreach (var paragraph in section.Paragraphs)
{
documentText.Append("--> ");
// And write out each sentence, separated by a space
documentText.AppendLine(string.Join(" ", paragraph.Sentences));
}
}
// To make the underline approach above look
// half-way decent, we need a fixed-width font
richTextBox1.Font = new Font(FontFamily.GenericMonospace, 9);
// Now set the RichTextBox Text equal to the StringBuilder Text
richTextBox1.Text = documentText.ToString();
我需要将文本中的每个句子 document/string 放入数组中。
问题在于如何处理 headers、标题等不属于句子的文本部分,但不要以句号“.”结尾来检测。 无法检测到这些将导致它们被卡在下面句子的前面(如果我使用“。”来区分句子)这是我不可能发生的。
最初我打算使用:
contentRefined = content.Replace(" \n", ". ");
我认为这会删除所有空行和换行符,并在 headers 的末尾放置句号以被检测并视为句子,这将导致“. .”但是我可以再次用任何东西替换它们。
但是没有用,它只是留下了完整的空行,只是在空行的开头放了一个“.”....以及在每个段落的开头放了一个“.”
我已经试过了:
contentRefined = Regex.Replace(content, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
这完全删除了完整的空行,但并没有让我更接近于在 headers.
的末尾添加句号我需要将句子和 headers/titles 放在一个数组中,我不确定是否有一种方法可以做到这一点而不必用诸如“。”之类的东西来分割字符串。
编辑:显示我如何从文件中获取测试的完整当前代码
public void sentenceSplit()
{
content = File.ReadAllText(@"I:\Project\TLDR\Test Text.txt");
contentRefined = Regex.Replace(content, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
//contentRefined = content.Replace("\n", ". ");
}
我假设 'Header' 和 'Title' 在他们自己的线上并且不以句点结束。
如果是这样,那么这可能对你有用:
var filePath = @"C:\Temp\temp.txt";
var sentences = new List<string>();
using (TextReader reader = new StreamReader(filePath))
{
while (reader.Peek() >= 0)
{
var line = reader.ReadLine();
if (line.Trim().EndsWith("."))
{
line.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries)
.ToList()
.ForEach(l => sentences.Add(l.Trim() + "."));
}
}
}
// Output sentences to console
sentences.ForEach(Console.WriteLine);
更新
另一种方法使用 File.ReadAllLines()
方法,并在 RichTextBox
:
private void Form1_Load(object sender, EventArgs e)
{
var filePath = @"C:\Temp\temp.txt";
var sentences = File.ReadAllLines(filePath)
// Only select lines that end in a period
.Where(l => l.Trim().EndsWith("."))
// Split each line into sentences (one line may have many sentences)
.SelectMany(s => s.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries))
// Trim any whitespace off the ends of the sentence and add a period to the end
.Select(s => s.Trim() + ".")
// And finally cast it to a List (or you could do 'ToArray()')
.ToList();
// To show each sentence in the list on it's own line in the rtb:
richTextBox1.Text = string.Join("\n", sentences);
// Or to show them all, one after another:
richTextBox1.Text = string.Join(" ", sentences);
}
更新
现在我想我明白你的问题了,这就是我要做的。首先,我会创建一些 类 来管理所有这些东西。如果将文档分解成多个部分,您会得到如下内容:
HEADER
Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".
Header Over an Empty Section
Header over multiple paragraphs
Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".
Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".
Paragraph sentence one. Paragraph sentence two. Paragraph sentence three with a number, like in this quote: ".00 doesn't go as far as it used to".
所以我会创建以下 类。首先,一个代表一个'Section'。这是由 Header 和零到多个段落定义的:
private class Section
{
public string Header { get; set; }
public List<Paragraph> Paragraphs { get; set; }
public Section()
{
Paragraphs = new List<Paragraph>();
}
}
然后我会定义一个段落,其中包含一个或多个句子:
private class Paragraph
{
public List<string> Sentences { get; set; }
public Paragraph()
{
Sentences = new List<string>();
}
}
现在我可以填充部分列表来表示文档:
var filePath = @"C:\Temp\temp.txt";
var sections = new List<Section>();
var currentSection = new Section();
var currentParagraph = new Paragraph();
using (TextReader reader = new StreamReader(filePath))
{
while (reader.Peek() >= 0)
{
var line = reader.ReadLine().Trim();
// Ignore blank lines
if (string.IsNullOrWhiteSpace(line)) continue;
if (line.EndsWith("."))
{
// This line is a paragraph, so add all the sentences
// it contains to the current paragraph
line.Split(new[] {". "}, StringSplitOptions.RemoveEmptyEntries)
.Select(l => l.Trim().EndsWith(".") ? l.Trim() : l.Trim() + ".")
.ToList()
.ForEach(l => currentParagraph.Sentences.Add(l));
// Now add this paragraph to the current section
currentSection.Paragraphs.Add(currentParagraph);
// And set it to a new paragraph for the next loop
currentParagraph = new Paragraph();
}
else if (line.Length > 0)
{
// This line is a header, so we're starting a new section.
// Add the current section to our list and create a
// a new one, setting this line as the header.
sections.Add(currentSection);
currentSection = new Section {Header = line};
}
}
// Finally, if the current section contains any data, add it to the list
if (currentSection.Header.Length > 0 || currentSection.Paragraphs.Any())
{
sections.Add(currentSection);
}
}
现在我们在章节列表中有了整个文档,并且我们知道顺序、headers、段落以及它们包含的句子。作为如何分析它的示例,这里有一种将其写回 RichTextBox
:
// We can build the document section by section
var documentText = new StringBuilder();
foreach (var section in sections)
{
// Here we can display headers and paragraphs in a custom way.
// For example, we can separate all sections with a blank line:
documentText.AppendLine();
// If there is a header, we can underline it
if (!string.IsNullOrWhiteSpace(section.Header))
{
documentText.AppendLine(section.Header);
documentText.AppendLine(new string('-', section.Header.Length));
}
// We can mark each paragraph with an arrow (--> )
foreach (var paragraph in section.Paragraphs)
{
documentText.Append("--> ");
// And write out each sentence, separated by a space
documentText.AppendLine(string.Join(" ", paragraph.Sentences));
}
}
// To make the underline approach above look
// half-way decent, we need a fixed-width font
richTextBox1.Font = new Font(FontFamily.GenericMonospace, 9);
// Now set the RichTextBox Text equal to the StringBuilder Text
richTextBox1.Text = documentText.ToString();