具有特定大小的 CSV 文件拆分

CSV File Splitting with specific size

大家好,我有一个函数可以从 smaller chunks based on size 中的 DataTable 通过 app.config key/value 对传递 create multiple CSV files

以下代码存在问题:

  1. 我已将文件大小硬编码为 1 kb,当我传递值 20 时,它应该创建 20kb 的 csv 文件。当前,它正在为相同的值创建大小为 5kb 的文件。
  2. 对于最后留下的记录,它没有创建任何文件。

请帮我解决这个问题。谢谢!

代码:

public static void CreateCSVFile(DataTable dt, string CSVFileName)
    {

        int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
        size *= 1024; //1 KB size
        string CSVPath = ConfigurationManager.AppSettings["CSVPath"];

        StringBuilder FirstLine = new StringBuilder();
        StringBuilder records = new StringBuilder();

        int num = 0;
        int length = 0;

        IEnumerable<string> columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
        FirstLine.AppendLine(string.Join(",", columnNames));
        records.AppendLine(FirstLine.ToString());

        length += records.ToString().Length;

        foreach (DataRow row in dt.Rows)
        {
            //Putting field values in double quotes
            IEnumerable<string> fields = row.ItemArray.Select(field =>
                string.Concat("\"", field.ToString().Replace("\"", "\"\""), "\""));

            records.AppendLine(string.Join(",", fields));
            length += records.ToString().Length;

            if (length > size)
            {
                //Create a new file
                num++;
                File.WriteAllText(CSVPath + CSVFileName + DateTime.Now.ToString("yyyyMMddHHmmss") + num.ToString("_000") + ".csv", records.ToString());
                records.Clear();
                length = 0;
                records.AppendLine(FirstLine.ToString());
            }

        }            
    }  

解决方案非常简单...您不需要将所有行都放在内存中(就像在 string[] arr = File.ReadAllLines(FilePath); 中所做的那样)。

相反,在输入文件上创建一个 StreamReader,并逐行读取到行缓冲区。当缓冲区超过您的 "threshold size" 时,将其写入磁盘到单个 csv 文件中。代码应该是这样的:

using (var sr = new System.IO.StreamReader(filePath))
{
    var linesBuffer = new List<string>();
    while (sr.Peek() >= 0)
    {
        linesBuffer.Add(sr.ReadLine());
        if (linesBuffer.Count > yourThreshold)
        {
            // TODO: implement function WriteLinesToPartialCsv
            WriteLinesToPartialCsv(linesBuffer);
            // Clear the buffer:
            linesBuffer.Clear();
            // Try forcing c# to clear the memory:
            GC.Collect();
        }
    }
}

如您所见,逐行读取流(而不是像您的代码那样读取整个 CSV 输入文件),您可以更好地控制内存。

使用File.ReadLines, Linq means deferred execution将执行。

foreach(var line in File.ReadLines(FilePath))
{
   // logic here.
}

来自 MSDN

The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.

现在,您可以如下重写您的方法。

    public static void SplitCSV(string FilePath, string FileName)
    {
        //Read Specified file size
        int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);

        size *= 1024 * 1024;  //1 MB size

        int total = 0;
        int num = 0;
        string FirstLine = null;   // header to new file                  
        var writer = new StreamWriter(GetFileName(FileName, num));

        // Loop through all source lines
        foreach (var line in File.ReadLines(FilePath))
        {
            if (string.IsNullOrEmpty(FirstLine)) FirstLine = line;
            // Length of current line
            int length = line.Length;

            // See if adding this line would exceed the size threshold
            if (total + length >= size)
            {
                // Create a new file
                num++;
                total = 0;
                writer.Dispose();
                writer = new StreamWriter(GetFileName(FileName, num));
                writer.WriteLine(FirstLine);
                length += FirstLine.Length;
            }

            // Write the line to the current file                
            writer.WriteLine(line);

            // Add length of line in bytes to running size
            total += length;

            // Add size of newlines
            total += Environment.NewLine.Length;
        }
   }