实现文本文件的异步搜索

Implementing Asynchronous Search of a Text File

我正在创建一个 Windows 表单应用程序,允许用户将文本文件指定为数据源,根据文件中的列数动态创建表单控件,并允许用户输入搜索参数,单击搜索按钮时将用于搜索文件。所有结果都将写入新的文本文件。

此程序将搜索的文件通常非常大(最大 12 GB)。我当前的搜索方法(读取一行,搜索它,如果命中则将其添加到结果文件中)对于大小合理的文件(几 MB 左右)非常有效。使用我的 "large" 测试文件 (~2.5 GB),搜索该文件大约需要 12 分钟。

所以我的问题是:提高性能的最佳方法是什么?经过大量搜索和阅读,我知道我有以下选择:

由于我的程序逻辑更像是一个流,所以我倾向于数据流,但我不确定如何正确地实现它,或者是否有更好的解决方案。下面是搜索按钮的 clickEvent 代码以及与搜索相关的函数。

'Searches the loaded file
    Private Sub searchBtn_Click(sender As Object, e As EventArgs) Handles searchBtn.Click
        Dim strFileName As String
        Dim didWork As Integer
        Dim searchHits As Integer
        Dim watch As Stopwatch = Stopwatch.StartNew()

        'Prompts user to enter title of file to be created
        exportFD.Title = "Save as. . ."
        exportFD.Filter = "Text Files(*.txt)|*.txt" 'Limits user to only saving as .txt file
        exportFD.ShowDialog()

        If didWork = DialogResult.Cancel Then 'Handles if Cancel Button is clicked
            Return
        Else
            strFileName = exportFD.FileName
            Dim writer As New IO.StreamWriter(strFileName, False) 
            Dim reader As New IO.StreamReader(filepath)
            Dim currentLine As String

            'Skip first line of SOURCE text file for search, but use it to write column headers to file
            currentLine = reader.ReadLine()
            Dim columnLine = currentLine.Split(vbTab)

            'First: Insert column names into NEW text file
            For col As Integer = 0 To colCount - 1
                writer.Write(columnLine(col) & vbTab)
            Next
            writer.Write(vbNewLine)

            'Search whole file, line by line
            Do While reader.Peek() > 0
                'next line
                currentLine = reader.ReadLine()

                'new function:
                If validChromosome(currentLine) Then
                    writer.WriteLine(currentLine)
                    searchHits += 1
                End If
            Loop

            'Close out writer and reader and tell user file was saved
            writer.Close()
            reader.Close()
            searchTxtB.Text = searchHits.ToString()
            watch.Stop()
            MsgBox("Searched in: " + watch.Elapsed.ToString() + " and saved to: " + strFileName)
        End If

    End Sub

    'This function searches through the current line and checks if it follows what the user has searched for
    Private Function validChromosome(chromString As String) As Boolean

        'Split line by delimiter
        Dim readRow() As String = Split(chromString, vbTab)
        validChromosome = True 'Start off as true

        Dim rowLength As Integer = readRow.Length - 1

        'Iterate through string tokens and compare 
        For token As Integer = 0 To rowLength
            Try
                Dim currentGroupBox As GroupBox = criteriaPanel.Controls.Item(token)
                Dim checkedParameter As CheckBox = currentGroupBox.Controls("CheckBox")

                'User wants to search this parameter
                If checkedParameter.Checked = True Then
                    Dim numericRadio As RadioButton = currentGroupBox.Controls("NumericRadio")

                    'Searching by number
                    If numericRadio.Checked = True Then
                        Dim value As Decimal
                        Dim lowerBox As NumericUpDown = currentGroupBox.Controls("NumericBoxLower")
                        Dim upperBox As NumericUpDown = currentGroupBox.Controls("NumericBoxUpper")

                        Dim lowerInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveLowerCheckBox")
                        Dim upperInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveUpperCheckBox")

                        'Try to convert the text to a decimal. 
                        If Not Decimal.TryParse(readRow(token), value) Then
                            validChromosome = False
                            Exit For
                        End If

                       'Not within the given range user inputted for numeric search
                        If Not withinRange(value, lowerBox.Value, upperBox.Value, lowerInclusiveCheck.Checked, upperInclusiveCheck.Checked) Then
                            validChromosome = False
                            Exit For
                        End If

                    Else 'Searching by text
                        Dim textBox As TextBox = currentGroupBox.Controls("TextBox")

                        'If the comparison failed, then this chromosome is not valid. Break out of loop and return false.
                        If Not [String].Equals(readRow(token), textBox.Text.ToString(), StringComparison.OrdinalIgnoreCase) Then

                            validChromosome = False
                            Exit For

                        End If
                    End If

                End If


            Catch ex As Exception

                'Simple error checking.
                MsgBox(ex.ToString)
                validChromosome = False
                Exit For

            End Try
        Next

    End Function

    'Function to check if value safely in betweeen two values
    Private Function withinRange(value As Decimal, lower As Decimal, upper As   Decimal, inclusiveLower As Boolean, inclusiveUpper As Boolean) As Boolean
        withinRange = False
        Dim lowerCheck As Boolean = False
        Dim upperCheck As Boolean = False

        If inclusiveLower Then
            lowerCheck = value >= lower
        Else
            lowerCheck = value > lower
        End If

        If inclusiveUpper Then
            upperCheck = value <= upper
        Else
            upperCheck = value < upper
        End If

        withinRange = lowerCheck And upperCheck

    End Function

我目前的理论是我应该创建一个包含我的文件读取方法的 TransformBlock 并创建一个小缓冲区(~10 行),该缓冲区将传递给另一个 TransformBlock 搜索它们并将结果放入列表中,然后将传递给另一个 TransformBlock 以写入导出文件。

很可能我的搜索功能(validChromosome)可能不是很好,所以也欢迎任何改进建议。这是我的第一个程序,我知道 VB.net 可能不是处理文本文件的最佳语言,但我不得不使用它。在此先感谢您的帮助,如果需要更多信息,请告诉我。

TPL 数据流似乎很适合,尤其是因为它很容易支持 async

我会保持读取顺序,因为 HD 在并发读取中表现不佳,因此不需要块,只需在 while 循环中读取缓冲区并 post 到 TDF 块。然后你可以有一个 TransformBlock 搜索该缓冲区并将结果移动到下一个保存到文件的块。

TransfromBlock可以运行并联所以你应该设置合适的MaxDegreeOfParallelism(可能是Environment.ProcessorCount)。