实现文本文件的异步搜索
Implementing Asynchronous Search of a Text File
我正在创建一个 Windows 表单应用程序,允许用户将文本文件指定为数据源,根据文件中的列数动态创建表单控件,并允许用户输入搜索参数,单击搜索按钮时将用于搜索文件。所有结果都将写入新的文本文件。
此程序将搜索的文件通常非常大(最大 12 GB)。我当前的搜索方法(读取一行,搜索它,如果命中则将其添加到结果文件中)对于大小合理的文件(几 MB 左右)非常有效。使用我的 "large" 测试文件 (~2.5 GB),搜索该文件大约需要 12 分钟。
所以我的问题是:提高性能的最佳方法是什么?经过大量搜索和阅读,我知道我有以下选择:
- 异步方法
- 任务
- TPL 数据流
- 这些方法的一些组合
由于我的程序逻辑更像是一个流,所以我倾向于数据流,但我不确定如何正确地实现它,或者是否有更好的解决方案。下面是搜索按钮的 clickEvent 代码以及与搜索相关的函数。
'Searches the loaded file
Private Sub searchBtn_Click(sender As Object, e As EventArgs) Handles searchBtn.Click
Dim strFileName As String
Dim didWork As Integer
Dim searchHits As Integer
Dim watch As Stopwatch = Stopwatch.StartNew()
'Prompts user to enter title of file to be created
exportFD.Title = "Save as. . ."
exportFD.Filter = "Text Files(*.txt)|*.txt" 'Limits user to only saving as .txt file
exportFD.ShowDialog()
If didWork = DialogResult.Cancel Then 'Handles if Cancel Button is clicked
Return
Else
strFileName = exportFD.FileName
Dim writer As New IO.StreamWriter(strFileName, False)
Dim reader As New IO.StreamReader(filepath)
Dim currentLine As String
'Skip first line of SOURCE text file for search, but use it to write column headers to file
currentLine = reader.ReadLine()
Dim columnLine = currentLine.Split(vbTab)
'First: Insert column names into NEW text file
For col As Integer = 0 To colCount - 1
writer.Write(columnLine(col) & vbTab)
Next
writer.Write(vbNewLine)
'Search whole file, line by line
Do While reader.Peek() > 0
'next line
currentLine = reader.ReadLine()
'new function:
If validChromosome(currentLine) Then
writer.WriteLine(currentLine)
searchHits += 1
End If
Loop
'Close out writer and reader and tell user file was saved
writer.Close()
reader.Close()
searchTxtB.Text = searchHits.ToString()
watch.Stop()
MsgBox("Searched in: " + watch.Elapsed.ToString() + " and saved to: " + strFileName)
End If
End Sub
'This function searches through the current line and checks if it follows what the user has searched for
Private Function validChromosome(chromString As String) As Boolean
'Split line by delimiter
Dim readRow() As String = Split(chromString, vbTab)
validChromosome = True 'Start off as true
Dim rowLength As Integer = readRow.Length - 1
'Iterate through string tokens and compare
For token As Integer = 0 To rowLength
Try
Dim currentGroupBox As GroupBox = criteriaPanel.Controls.Item(token)
Dim checkedParameter As CheckBox = currentGroupBox.Controls("CheckBox")
'User wants to search this parameter
If checkedParameter.Checked = True Then
Dim numericRadio As RadioButton = currentGroupBox.Controls("NumericRadio")
'Searching by number
If numericRadio.Checked = True Then
Dim value As Decimal
Dim lowerBox As NumericUpDown = currentGroupBox.Controls("NumericBoxLower")
Dim upperBox As NumericUpDown = currentGroupBox.Controls("NumericBoxUpper")
Dim lowerInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveLowerCheckBox")
Dim upperInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveUpperCheckBox")
'Try to convert the text to a decimal.
If Not Decimal.TryParse(readRow(token), value) Then
validChromosome = False
Exit For
End If
'Not within the given range user inputted for numeric search
If Not withinRange(value, lowerBox.Value, upperBox.Value, lowerInclusiveCheck.Checked, upperInclusiveCheck.Checked) Then
validChromosome = False
Exit For
End If
Else 'Searching by text
Dim textBox As TextBox = currentGroupBox.Controls("TextBox")
'If the comparison failed, then this chromosome is not valid. Break out of loop and return false.
If Not [String].Equals(readRow(token), textBox.Text.ToString(), StringComparison.OrdinalIgnoreCase) Then
validChromosome = False
Exit For
End If
End If
End If
Catch ex As Exception
'Simple error checking.
MsgBox(ex.ToString)
validChromosome = False
Exit For
End Try
Next
End Function
'Function to check if value safely in betweeen two values
Private Function withinRange(value As Decimal, lower As Decimal, upper As Decimal, inclusiveLower As Boolean, inclusiveUpper As Boolean) As Boolean
withinRange = False
Dim lowerCheck As Boolean = False
Dim upperCheck As Boolean = False
If inclusiveLower Then
lowerCheck = value >= lower
Else
lowerCheck = value > lower
End If
If inclusiveUpper Then
upperCheck = value <= upper
Else
upperCheck = value < upper
End If
withinRange = lowerCheck And upperCheck
End Function
我目前的理论是我应该创建一个包含我的文件读取方法的 TransformBlock 并创建一个小缓冲区(~10 行),该缓冲区将传递给另一个 TransformBlock 搜索它们并将结果放入列表中,然后将传递给另一个 TransformBlock 以写入导出文件。
很可能我的搜索功能(validChromosome)可能不是很好,所以也欢迎任何改进建议。这是我的第一个程序,我知道 VB.net 可能不是处理文本文件的最佳语言,但我不得不使用它。在此先感谢您的帮助,如果需要更多信息,请告诉我。
TPL 数据流似乎很适合,尤其是因为它很容易支持 async
。
我会保持读取顺序,因为 HD 在并发读取中表现不佳,因此不需要块,只需在 while 循环中读取缓冲区并 post 到 TDF 块。然后你可以有一个 TransformBlock 搜索该缓冲区并将结果移动到下一个保存到文件的块。
TransfromBlock
可以运行并联所以你应该设置合适的MaxDegreeOfParallelism
(可能是Environment.ProcessorCount
)。
我正在创建一个 Windows 表单应用程序,允许用户将文本文件指定为数据源,根据文件中的列数动态创建表单控件,并允许用户输入搜索参数,单击搜索按钮时将用于搜索文件。所有结果都将写入新的文本文件。
此程序将搜索的文件通常非常大(最大 12 GB)。我当前的搜索方法(读取一行,搜索它,如果命中则将其添加到结果文件中)对于大小合理的文件(几 MB 左右)非常有效。使用我的 "large" 测试文件 (~2.5 GB),搜索该文件大约需要 12 分钟。
所以我的问题是:提高性能的最佳方法是什么?经过大量搜索和阅读,我知道我有以下选择:
- 异步方法
- 任务
- TPL 数据流
- 这些方法的一些组合
由于我的程序逻辑更像是一个流,所以我倾向于数据流,但我不确定如何正确地实现它,或者是否有更好的解决方案。下面是搜索按钮的 clickEvent 代码以及与搜索相关的函数。
'Searches the loaded file
Private Sub searchBtn_Click(sender As Object, e As EventArgs) Handles searchBtn.Click
Dim strFileName As String
Dim didWork As Integer
Dim searchHits As Integer
Dim watch As Stopwatch = Stopwatch.StartNew()
'Prompts user to enter title of file to be created
exportFD.Title = "Save as. . ."
exportFD.Filter = "Text Files(*.txt)|*.txt" 'Limits user to only saving as .txt file
exportFD.ShowDialog()
If didWork = DialogResult.Cancel Then 'Handles if Cancel Button is clicked
Return
Else
strFileName = exportFD.FileName
Dim writer As New IO.StreamWriter(strFileName, False)
Dim reader As New IO.StreamReader(filepath)
Dim currentLine As String
'Skip first line of SOURCE text file for search, but use it to write column headers to file
currentLine = reader.ReadLine()
Dim columnLine = currentLine.Split(vbTab)
'First: Insert column names into NEW text file
For col As Integer = 0 To colCount - 1
writer.Write(columnLine(col) & vbTab)
Next
writer.Write(vbNewLine)
'Search whole file, line by line
Do While reader.Peek() > 0
'next line
currentLine = reader.ReadLine()
'new function:
If validChromosome(currentLine) Then
writer.WriteLine(currentLine)
searchHits += 1
End If
Loop
'Close out writer and reader and tell user file was saved
writer.Close()
reader.Close()
searchTxtB.Text = searchHits.ToString()
watch.Stop()
MsgBox("Searched in: " + watch.Elapsed.ToString() + " and saved to: " + strFileName)
End If
End Sub
'This function searches through the current line and checks if it follows what the user has searched for
Private Function validChromosome(chromString As String) As Boolean
'Split line by delimiter
Dim readRow() As String = Split(chromString, vbTab)
validChromosome = True 'Start off as true
Dim rowLength As Integer = readRow.Length - 1
'Iterate through string tokens and compare
For token As Integer = 0 To rowLength
Try
Dim currentGroupBox As GroupBox = criteriaPanel.Controls.Item(token)
Dim checkedParameter As CheckBox = currentGroupBox.Controls("CheckBox")
'User wants to search this parameter
If checkedParameter.Checked = True Then
Dim numericRadio As RadioButton = currentGroupBox.Controls("NumericRadio")
'Searching by number
If numericRadio.Checked = True Then
Dim value As Decimal
Dim lowerBox As NumericUpDown = currentGroupBox.Controls("NumericBoxLower")
Dim upperBox As NumericUpDown = currentGroupBox.Controls("NumericBoxUpper")
Dim lowerInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveLowerCheckBox")
Dim upperInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveUpperCheckBox")
'Try to convert the text to a decimal.
If Not Decimal.TryParse(readRow(token), value) Then
validChromosome = False
Exit For
End If
'Not within the given range user inputted for numeric search
If Not withinRange(value, lowerBox.Value, upperBox.Value, lowerInclusiveCheck.Checked, upperInclusiveCheck.Checked) Then
validChromosome = False
Exit For
End If
Else 'Searching by text
Dim textBox As TextBox = currentGroupBox.Controls("TextBox")
'If the comparison failed, then this chromosome is not valid. Break out of loop and return false.
If Not [String].Equals(readRow(token), textBox.Text.ToString(), StringComparison.OrdinalIgnoreCase) Then
validChromosome = False
Exit For
End If
End If
End If
Catch ex As Exception
'Simple error checking.
MsgBox(ex.ToString)
validChromosome = False
Exit For
End Try
Next
End Function
'Function to check if value safely in betweeen two values
Private Function withinRange(value As Decimal, lower As Decimal, upper As Decimal, inclusiveLower As Boolean, inclusiveUpper As Boolean) As Boolean
withinRange = False
Dim lowerCheck As Boolean = False
Dim upperCheck As Boolean = False
If inclusiveLower Then
lowerCheck = value >= lower
Else
lowerCheck = value > lower
End If
If inclusiveUpper Then
upperCheck = value <= upper
Else
upperCheck = value < upper
End If
withinRange = lowerCheck And upperCheck
End Function
我目前的理论是我应该创建一个包含我的文件读取方法的 TransformBlock 并创建一个小缓冲区(~10 行),该缓冲区将传递给另一个 TransformBlock 搜索它们并将结果放入列表中,然后将传递给另一个 TransformBlock 以写入导出文件。
很可能我的搜索功能(validChromosome)可能不是很好,所以也欢迎任何改进建议。这是我的第一个程序,我知道 VB.net 可能不是处理文本文件的最佳语言,但我不得不使用它。在此先感谢您的帮助,如果需要更多信息,请告诉我。
TPL 数据流似乎很适合,尤其是因为它很容易支持 async
。
我会保持读取顺序,因为 HD 在并发读取中表现不佳,因此不需要块,只需在 while 循环中读取缓冲区并 post 到 TDF 块。然后你可以有一个 TransformBlock 搜索该缓冲区并将结果移动到下一个保存到文件的块。
TransfromBlock
可以运行并联所以你应该设置合适的MaxDegreeOfParallelism
(可能是Environment.ProcessorCount
)。