尽可能快地打开和读取数千个文件
Open and read thousands of files as fast as possible
我需要尽快打开和阅读数千个文件。
我对 13 592 个文件进行了 运行 一些测试,发现方法 1 比方法 2 稍快。这些文件通常在 800 字节到 4kB 之间。我想知道我是否可以做些什么来加快这个 I/O-bound 过程?
Method 1:
Run 1: 3:05 (don't know what happened here)
Run 2: 1:55
Run 3: 2:06
Run 4: 2:02
Method 2:
Run 1: 2:04
Run 2: 2:08
Run 3: 2:04
Run 4: 2:12
代码如下:
public class FileOpenerUtil
{
/// <summary>
///
/// </summary>
/// <param name="fullFilePath"></param>
/// <returns></returns>
public static string ReadFileToString(string fullFilePath)
{
while (true)
{
try
{
//Methode 1
using (StreamReader sr = File.OpenText(fullFilePath))
{
string fullMessage = "";
string s;
while ((s = sr.ReadLine()) != null)
{
fullMessage += s + "\n";
}
return RemoveCarriageReturn(fullMessage);
}
//Methode 2
/*using (File.Open(fullFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
Console.WriteLine("Output file {0} ready.", fullFilePath);
string[] lines = File.ReadAllLines(fullFilePath);
//Every new line under the previous line
string fullMessage = lines.Aggregate("", (current, s) => current + s + "\n");
return RemoveCarriageReturn(fullMessage);
//ninject kernel
}*/
//Methode 3
}
catch (FileNotFoundException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (IOException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (UnauthorizedAccessException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
}
}
/// <summary>
/// Verwijdert '\r' in een string sequence
/// </summary>
/// <param name="message">The text that has to be changed</param>
/// <returns>The changed text</returns>
private static string RemoveCarriageReturn(string message)
{
return message.Replace("\r", "");
}
}
我正在阅读的文件是 .HL7 文件,看起来像这样:
MSH|^~\&|OAZIS||||20150430235954||ADT^A03|23669166|P|2.3||||||ASCII
EVN|A03|20150430235954||||201504302359
PID|1||6001144000||LastName^FirstName^^^Mevr.|LastName^FirstName|19600114|F|||GStreetName Number^^City^^PostalCode^B^H||09/3444556^^PH~0476519246echtg^^CP||NL|M||28783409^^^^VN|0000000000|60011402843||||||B||||N
PD1||||003847^LastName^FirstName||||||||N|||0
PV1|1|O|FDAG^000^053^001^0^2|NULL||FDAG^000^053^001|003847^LastName^FirstName||006813^LastName^FirstName|1900|00||||||006813^LastName^FirstName|0|28783409^^^^VN|1^20150430|01|||||||||||||||1|1||D|||||201504301336|201504302359
OBX|1|CE|KIND_OF_DIS|RCM|1^1 Op medisch advies
OBX|2|CE|DESTINATION_DIS|RCM|1^1 Terug naar huis
打开文件后,我用 j4jayant's HL7 parser 解析字符串并关闭文件。
我已经应用了评论中的所有代码。方法 1 似乎仍然是最快的。
public class FileOpenerUtil
{
/// <summary>
///
/// </summary>
/// <param name="fullFilePath"></param>
/// <returns></returns>
public static string ReadFileToString(string fullFilePath)
{
while (true)
{
try
{
//Method 1
using (StreamReader sr = File.OpenText(fullFilePath))
{
string s;
StringBuilder message = new StringBuilder();
while ((s = sr.ReadLine()) != null)
{
message.Append(s).Append("\n");
}
return RemoveCarriageReturn(message.ToString());
}
//Method 2
/*
string[] lines = File.ReadAllLines(fullFilePath);
string fullMessage = lines.Aggregate("", (current, s) => current + s + "\n");
return RemoveCarriageReturn(fullMessage);*/
}
//Method 3
/*
string s = File.ReadAllText(fullFilePath);
return RemoveCarriageReturn(s);*/
}
catch (FileNotFoundException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (IOException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (UnauthorizedAccessException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
}
}
/// <summary>
/// Verwijdert '\r' in een string sequence
/// </summary>
/// <param name="message">The text that has to be changed</param>
/// <returns>The changed text</returns>
private static string RemoveCarriageReturn(string message)
{
return message.Replace("\r", "");
}
}
我使用了 50,000 个不同大小的文件(500 到 1024 字节)。
测试 1:您的方法 1 StreamReader sr = File.OpenText(fullFilePath); sr.ReadLine();
秒数:3,4658937968113
测试 2:您的方法 2 File.ReadAllLines(fullFilePath)
秒数:5,5008349279222
测试 3:File.ReadAllText(fullFilePath);
秒数:3,30782645637133
测试 4:BinaryReader b = new BinaryReader; b.ReadString();
秒数:5,85779941381009
测试 5:Windows FileReader
(https://msdn.microsoft.com/en-us/library/2d9wy99d.aspx)
秒数:3,07036554759848
测试 6:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:3,31464109255517
测试 7:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:3,3364683664508
测试 8:StreamReader sr = File.OpenText(fullFilePath); sr.ReadLine();
秒数:3,40426888695317
测试 9:FileStream + BufferedStream + StreamReader
秒数:4,02871911079061
测试 10:Parallel.For using code File.ReadAllText(fullFilePath);
秒数:0,89543632235447
最好的测试结果是测试5和测试3(单线程)
测试 3 正在使用:File.ReadAllText(fullFilePath);
测试 5 使用 Windows FileReader
(https://msdn.microsoft.com/en-us/library/2d9wy99d.aspx)
如果可以使用线程 测试 10 是迄今为止最快的。
示例:
int maxFiles = 50000;
int j = 0;
Parallel.For(0, maxFiles, x =>
{
Util.Method1("readtext_" + j + ".txt"); // your read method
j++;
});
使用RAMMap清空备用列表时:
测试 1:您的方法 1 StreamReader sr = File.OpenText(fullFilePath);
sr.ReadLine();
秒数:15,1785750622961
测试 2:您的方法 2 File.ReadAllLines(fullFilePath)
秒数:17,650864469466
测试 3:File.ReadAllText(fullFilePath);
秒数:14,8985912878328
测试 4:BinaryReader b = new BinaryReader; b.ReadString();
秒数:18,1603815767866
测试 5:Windows FileReader
秒数:14,5059765845334
测试 6:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:14,8649786336991
测试 7:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:14,830567197641
测试 8:StreamReader sr = File.OpenText(fullFilePath); sr.ReadLine();
秒数:14,9965866575751
测试 9:FileStream + BufferedStream + StreamReader
秒数:15,7336450516575
测试 10:Parallel.For() using code File.ReadAllText(fullFilePath);
秒数:4,11343060325439
我需要尽快打开和阅读数千个文件。
我对 13 592 个文件进行了 运行 一些测试,发现方法 1 比方法 2 稍快。这些文件通常在 800 字节到 4kB 之间。我想知道我是否可以做些什么来加快这个 I/O-bound 过程?
Method 1:
Run 1: 3:05 (don't know what happened here)
Run 2: 1:55
Run 3: 2:06
Run 4: 2:02
Method 2:
Run 1: 2:04
Run 2: 2:08
Run 3: 2:04
Run 4: 2:12
代码如下:
public class FileOpenerUtil
{
/// <summary>
///
/// </summary>
/// <param name="fullFilePath"></param>
/// <returns></returns>
public static string ReadFileToString(string fullFilePath)
{
while (true)
{
try
{
//Methode 1
using (StreamReader sr = File.OpenText(fullFilePath))
{
string fullMessage = "";
string s;
while ((s = sr.ReadLine()) != null)
{
fullMessage += s + "\n";
}
return RemoveCarriageReturn(fullMessage);
}
//Methode 2
/*using (File.Open(fullFilePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
Console.WriteLine("Output file {0} ready.", fullFilePath);
string[] lines = File.ReadAllLines(fullFilePath);
//Every new line under the previous line
string fullMessage = lines.Aggregate("", (current, s) => current + s + "\n");
return RemoveCarriageReturn(fullMessage);
//ninject kernel
}*/
//Methode 3
}
catch (FileNotFoundException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (IOException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (UnauthorizedAccessException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
}
}
/// <summary>
/// Verwijdert '\r' in een string sequence
/// </summary>
/// <param name="message">The text that has to be changed</param>
/// <returns>The changed text</returns>
private static string RemoveCarriageReturn(string message)
{
return message.Replace("\r", "");
}
}
我正在阅读的文件是 .HL7 文件,看起来像这样:
MSH|^~\&|OAZIS||||20150430235954||ADT^A03|23669166|P|2.3||||||ASCII EVN|A03|20150430235954||||201504302359 PID|1||6001144000||LastName^FirstName^^^Mevr.|LastName^FirstName|19600114|F|||GStreetName Number^^City^^PostalCode^B^H||09/3444556^^PH~0476519246echtg^^CP||NL|M||28783409^^^^VN|0000000000|60011402843||||||B||||N PD1||||003847^LastName^FirstName||||||||N|||0 PV1|1|O|FDAG^000^053^001^0^2|NULL||FDAG^000^053^001|003847^LastName^FirstName||006813^LastName^FirstName|1900|00||||||006813^LastName^FirstName|0|28783409^^^^VN|1^20150430|01|||||||||||||||1|1||D|||||201504301336|201504302359 OBX|1|CE|KIND_OF_DIS|RCM|1^1 Op medisch advies OBX|2|CE|DESTINATION_DIS|RCM|1^1 Terug naar huis
打开文件后,我用 j4jayant's HL7 parser 解析字符串并关闭文件。
我已经应用了评论中的所有代码。方法 1 似乎仍然是最快的。
public class FileOpenerUtil
{
/// <summary>
///
/// </summary>
/// <param name="fullFilePath"></param>
/// <returns></returns>
public static string ReadFileToString(string fullFilePath)
{
while (true)
{
try
{
//Method 1
using (StreamReader sr = File.OpenText(fullFilePath))
{
string s;
StringBuilder message = new StringBuilder();
while ((s = sr.ReadLine()) != null)
{
message.Append(s).Append("\n");
}
return RemoveCarriageReturn(message.ToString());
}
//Method 2
/*
string[] lines = File.ReadAllLines(fullFilePath);
string fullMessage = lines.Aggregate("", (current, s) => current + s + "\n");
return RemoveCarriageReturn(fullMessage);*/
}
//Method 3
/*
string s = File.ReadAllText(fullFilePath);
return RemoveCarriageReturn(s);*/
}
catch (FileNotFoundException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (IOException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
catch (UnauthorizedAccessException ex)
{
Console.WriteLine("Output file {0} not yet ready ({1})", fullFilePath, ex.Message);
}
}
}
/// <summary>
/// Verwijdert '\r' in een string sequence
/// </summary>
/// <param name="message">The text that has to be changed</param>
/// <returns>The changed text</returns>
private static string RemoveCarriageReturn(string message)
{
return message.Replace("\r", "");
}
}
我使用了 50,000 个不同大小的文件(500 到 1024 字节)。
测试 1:您的方法 1 StreamReader sr = File.OpenText(fullFilePath); sr.ReadLine();
秒数:3,4658937968113
测试 2:您的方法 2 File.ReadAllLines(fullFilePath)
秒数:5,5008349279222
测试 3:File.ReadAllText(fullFilePath);
秒数:3,30782645637133
测试 4:BinaryReader b = new BinaryReader; b.ReadString();
秒数:5,85779941381009
测试 5:Windows FileReader
(https://msdn.microsoft.com/en-us/library/2d9wy99d.aspx)
秒数:3,07036554759848
测试 6:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:3,31464109255517
测试 7:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:3,3364683664508
测试 8:StreamReader sr = File.OpenText(fullFilePath); sr.ReadLine();
秒数:3,40426888695317
测试 9:FileStream + BufferedStream + StreamReader
秒数:4,02871911079061
测试 10:Parallel.For using code File.ReadAllText(fullFilePath);
秒数:0,89543632235447
最好的测试结果是测试5和测试3(单线程)
测试 3 正在使用:File.ReadAllText(fullFilePath);
测试 5 使用 Windows FileReader
(https://msdn.microsoft.com/en-us/library/2d9wy99d.aspx)
如果可以使用线程 测试 10 是迄今为止最快的。
示例:
int maxFiles = 50000;
int j = 0;
Parallel.For(0, maxFiles, x =>
{
Util.Method1("readtext_" + j + ".txt"); // your read method
j++;
});
使用RAMMap清空备用列表时:
测试 1:您的方法 1 StreamReader sr = File.OpenText(fullFilePath);
sr.ReadLine();
秒数:15,1785750622961
测试 2:您的方法 2 File.ReadAllLines(fullFilePath)
秒数:17,650864469466
测试 3:File.ReadAllText(fullFilePath);
秒数:14,8985912878328
测试 4:BinaryReader b = new BinaryReader; b.ReadString();
秒数:18,1603815767866
测试 5:Windows FileReader
秒数:14,5059765845334
测试 6:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:14,8649786336991
测试 7:StreamReader sr = File.OpenText(fullFilePath); sr.ReadToEnd();
秒数:14,830567197641
测试 8:StreamReader sr = File.OpenText(fullFilePath); sr.ReadLine();
秒数:14,9965866575751
测试 9:FileStream + BufferedStream + StreamReader
秒数:15,7336450516575
测试 10:Parallel.For() using code File.ReadAllText(fullFilePath);
秒数:4,11343060325439