具有特定大小的 CSV 文件拆分
CSV File Splitting with specific size
大家好,我有一个函数可以从 smaller chunks based on size
中的 DataTable
通过 app.config
key/value 对传递 create multiple CSV files
。
以下代码存在问题:
- 我已将文件大小硬编码为 1 kb,当我传递值
20
时,它应该创建 20kb
的 csv 文件。当前,它正在为相同的值创建大小为 5kb
的文件。
- 对于最后留下的记录,它没有创建任何文件。
请帮我解决这个问题。谢谢!
代码:
public static void CreateCSVFile(DataTable dt, string CSVFileName)
{
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024; //1 KB size
string CSVPath = ConfigurationManager.AppSettings["CSVPath"];
StringBuilder FirstLine = new StringBuilder();
StringBuilder records = new StringBuilder();
int num = 0;
int length = 0;
IEnumerable<string> columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
FirstLine.AppendLine(string.Join(",", columnNames));
records.AppendLine(FirstLine.ToString());
length += records.ToString().Length;
foreach (DataRow row in dt.Rows)
{
//Putting field values in double quotes
IEnumerable<string> fields = row.ItemArray.Select(field =>
string.Concat("\"", field.ToString().Replace("\"", "\"\""), "\""));
records.AppendLine(string.Join(",", fields));
length += records.ToString().Length;
if (length > size)
{
//Create a new file
num++;
File.WriteAllText(CSVPath + CSVFileName + DateTime.Now.ToString("yyyyMMddHHmmss") + num.ToString("_000") + ".csv", records.ToString());
records.Clear();
length = 0;
records.AppendLine(FirstLine.ToString());
}
}
}
解决方案非常简单...您不需要将所有行都放在内存中(就像在 string[] arr = File.ReadAllLines(FilePath);
中所做的那样)。
相反,在输入文件上创建一个 StreamReader
,并逐行读取到行缓冲区。当缓冲区超过您的 "threshold size" 时,将其写入磁盘到单个 csv 文件中。代码应该是这样的:
using (var sr = new System.IO.StreamReader(filePath))
{
var linesBuffer = new List<string>();
while (sr.Peek() >= 0)
{
linesBuffer.Add(sr.ReadLine());
if (linesBuffer.Count > yourThreshold)
{
// TODO: implement function WriteLinesToPartialCsv
WriteLinesToPartialCsv(linesBuffer);
// Clear the buffer:
linesBuffer.Clear();
// Try forcing c# to clear the memory:
GC.Collect();
}
}
}
如您所见,逐行读取流(而不是像您的代码那样读取整个 CSV 输入文件),您可以更好地控制内存。
使用File.ReadLines
, Linq
means deferred execution将执行。
foreach(var line in File.ReadLines(FilePath))
{
// logic here.
}
来自 MSDN
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.
现在,您可以如下重写您的方法。
public static void SplitCSV(string FilePath, string FileName)
{
//Read Specified file size
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024 * 1024; //1 MB size
int total = 0;
int num = 0;
string FirstLine = null; // header to new file
var writer = new StreamWriter(GetFileName(FileName, num));
// Loop through all source lines
foreach (var line in File.ReadLines(FilePath))
{
if (string.IsNullOrEmpty(FirstLine)) FirstLine = line;
// Length of current line
int length = line.Length;
// See if adding this line would exceed the size threshold
if (total + length >= size)
{
// Create a new file
num++;
total = 0;
writer.Dispose();
writer = new StreamWriter(GetFileName(FileName, num));
writer.WriteLine(FirstLine);
length += FirstLine.Length;
}
// Write the line to the current file
writer.WriteLine(line);
// Add length of line in bytes to running size
total += length;
// Add size of newlines
total += Environment.NewLine.Length;
}
}
大家好,我有一个函数可以从 smaller chunks based on size
中的 DataTable
通过 app.config
key/value 对传递 create multiple CSV files
。
以下代码存在问题:
- 我已将文件大小硬编码为 1 kb,当我传递值
20
时,它应该创建20kb
的 csv 文件。当前,它正在为相同的值创建大小为5kb
的文件。 - 对于最后留下的记录,它没有创建任何文件。
请帮我解决这个问题。谢谢!
代码:
public static void CreateCSVFile(DataTable dt, string CSVFileName)
{
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024; //1 KB size
string CSVPath = ConfigurationManager.AppSettings["CSVPath"];
StringBuilder FirstLine = new StringBuilder();
StringBuilder records = new StringBuilder();
int num = 0;
int length = 0;
IEnumerable<string> columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
FirstLine.AppendLine(string.Join(",", columnNames));
records.AppendLine(FirstLine.ToString());
length += records.ToString().Length;
foreach (DataRow row in dt.Rows)
{
//Putting field values in double quotes
IEnumerable<string> fields = row.ItemArray.Select(field =>
string.Concat("\"", field.ToString().Replace("\"", "\"\""), "\""));
records.AppendLine(string.Join(",", fields));
length += records.ToString().Length;
if (length > size)
{
//Create a new file
num++;
File.WriteAllText(CSVPath + CSVFileName + DateTime.Now.ToString("yyyyMMddHHmmss") + num.ToString("_000") + ".csv", records.ToString());
records.Clear();
length = 0;
records.AppendLine(FirstLine.ToString());
}
}
}
解决方案非常简单...您不需要将所有行都放在内存中(就像在 string[] arr = File.ReadAllLines(FilePath);
中所做的那样)。
相反,在输入文件上创建一个 StreamReader
,并逐行读取到行缓冲区。当缓冲区超过您的 "threshold size" 时,将其写入磁盘到单个 csv 文件中。代码应该是这样的:
using (var sr = new System.IO.StreamReader(filePath))
{
var linesBuffer = new List<string>();
while (sr.Peek() >= 0)
{
linesBuffer.Add(sr.ReadLine());
if (linesBuffer.Count > yourThreshold)
{
// TODO: implement function WriteLinesToPartialCsv
WriteLinesToPartialCsv(linesBuffer);
// Clear the buffer:
linesBuffer.Clear();
// Try forcing c# to clear the memory:
GC.Collect();
}
}
}
如您所见,逐行读取流(而不是像您的代码那样读取整个 CSV 输入文件),您可以更好地控制内存。
使用File.ReadLines
, Linq
means deferred execution将执行。
foreach(var line in File.ReadLines(FilePath))
{
// logic here.
}
来自 MSDN
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
现在,您可以如下重写您的方法。
public static void SplitCSV(string FilePath, string FileName)
{
//Read Specified file size
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024 * 1024; //1 MB size
int total = 0;
int num = 0;
string FirstLine = null; // header to new file
var writer = new StreamWriter(GetFileName(FileName, num));
// Loop through all source lines
foreach (var line in File.ReadLines(FilePath))
{
if (string.IsNullOrEmpty(FirstLine)) FirstLine = line;
// Length of current line
int length = line.Length;
// See if adding this line would exceed the size threshold
if (total + length >= size)
{
// Create a new file
num++;
total = 0;
writer.Dispose();
writer = new StreamWriter(GetFileName(FileName, num));
writer.WriteLine(FirstLine);
length += FirstLine.Length;
}
// Write the line to the current file
writer.WriteLine(line);
// Add length of line in bytes to running size
total += length;
// Add size of newlines
total += Environment.NewLine.Length;
}
}