如何使用 Delphi 读取文本文件中的最后一行
How to read last line in a text file using Delphi
我需要读取一些非常大的文本文件中的最后一行(以从数据中获取时间戳)。 TStringlist 是一种简单的方法,但它 returns 内存不足错误。我正在尝试使用 seek 和 blockread,但是缓冲区中的字符都是无意义的。这与unicode有关吗?
Function TForm1.ReadLastLine2(FileName: String): String;
var
FileHandle: File;
s,line: string;
ok: 0..1;
Buf: array[1..8] of Char;
k: longword;
i,ReadCount: integer;
begin
AssignFile (FileHandle,FileName);
Reset (FileHandle); // or for binary files: Reset (FileHandle,1);
ok := 0;
k := FileSize (FileHandle);
Seek (FileHandle, k-1);
s := '';
while ok<>1 do begin
BlockRead (FileHandle, buf, SizeOf(Buf)-1, ReadCount); //BlockRead ( var FileHandle : File; var Buffer; RecordCount : Integer {; var RecordsRead : Integer} ) ;
if ord (buf[1]) <>13 then //Arg to integer
s := s + buf[1]
else
ok := ok + 1;
k := k-1;
seek (FileHandle,k);
end;
CloseFile (FileHandle);
// Reverse the order in the line read
setlength (line,length(s));
for i:=1 to length(s) do
line[length(s) - i+1 ] := s[i];
Result := Line;
end;
基于www.delphipages.com/forum/showthread.php?t=102965
测试文件是我在 excel 中创建的一个简单的 CSV(这不是我最终需要阅读的 100MB)。
a,b,c,d,e,f,g,h,i,j,blank
A,B,C,D,E,F,G,H,I,J,blank
1,2,3,4,5,6,7,8,9,0,blank
Mary,had,a,little,lamb,His,fleece,was,white,as,snow
And,everywhere,that,Mary,went,The,lamb,was,sure,to,go
刚刚想到了一个新的解决方案。
同样,可能会有更好的,但这个是我想到的最好的。
function GetLastLine(textFilePath: string): string;
var
list: tstringlist;
begin
list := tstringlist.Create;
try
list.LoadFromFile(textFilePath);
result := list[list.Count-1];
finally
list.free;
end;
end;
你的char类型是两个字节,所以buffer是16字节。然后使用 blockread 将 sizeof(buffer)-1 字节读入其中,并检查前 2 个字节的字符是否等于 #13。
sizeof(buffer)-1 是狡猾的(那个 -1 来自哪里?),其余的是有效的,但前提是您的输入文件是 utf16。
另外你每次读8个(或16个)字符,但只比较一个然后再次查找。这也不太符合逻辑。
如果您的编码不是utf16,我建议您将缓冲区元素的类型更改为ansichar并删除-1
响应 kopiks 的建议,我想出了如何使用 TFilestream 来完成它,它可以与简单的测试文件一起工作,尽管当我在各种 csv 文件上使用它时可能还有一些 tweeks。另外,我并没有声称这是最有效的方法。
procedure TForm1.Button6Click(Sender: TObject);
Var
StreamSize, ApproxNumRows : Integer;
TempStr : String;
begin
if OpenDialog1.Execute then begin
TempStr := ReadLastLineOfTextFile(OpenDialog1.FileName,StreamSize, ApproxNumRows);
// TempStr := ReadFileStream('c:\temp\CSVTestFile.csv');
ShowMessage ('approximately '+ IntToStr(ApproxNumRows)+' Rows');
ListBox1.Items.Add(TempStr);
end;
end;
Function TForm1.ReadLastLineOfTextFile(const FileName: String; var StreamSize, ApproxNumRows : Integer): String;
const
MAXLINELENGTH = 256;
var
Stream: TFileStream;
BlockSize,CharCount : integer;
Hash13Found : Boolean;
Buffer : array [0..MAXLINELENGTH] of AnsiChar;
begin
Hash13Found := False;
Result :='';
Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
StreamSize := Stream.size;
if StreamSize < MAXLINELENGTH then
BlockSize := StreamSize
Else
BlockSize := MAXLINELENGTH;
// for CharCount := 0 to Length(Buffer)-1 do begin
// Buffer[CharCount] := #0; // zeroing the buffer can aid diagnostics
// end;
CharCount := 0;
Repeat
Stream.Seek(-(CharCount+3), 2); //+3 misses out the #0,#10,#13 at the end of the file
Stream.Read( Buffer[CharCount], 1);
Result := String(Buffer[CharCount]) + result;
if Buffer[CharCount] =#13 then
Hash13Found := True;
Inc(CharCount);
Until Hash13Found OR (CharCount = BlockSize);
ShowMessage(Result);
ApproxNumRows := Round(StreamSize / CharCount);
end;
您确实必须以大块从尾部到头部读取文件。
由于它太大,不适合内存 - 然后从头到尾逐行读取它会非常慢。使用 ReadLn
- 慢两倍。
您还必须准备好最后一行可能以 EOL 结束,也可能不结束。
就我个人而言,我还会考虑三种可能的 EOL 序列:
- CR/LF 又名#13#10=^M^J - DOS/Windows 风格
- 没有 LF 的 CR - 只是 #13=^M - 经典 MacOS 文件
- 没有 CR 的 LF - 只是 #10=^J - UNIX 风格,包括 MacOS 版本 10
如果您确定您的 CSV 文件只会由本机 Windows 程序生成,那么可以安全地假设使用了完整的 CR/LF。但是,如果可以有其他 Java 程序、非 Windows 平台、移动程序——我就不太确定了。当然,没有 LF 的纯 CR 是所有情况中可能性最小的。
uses System.IOUtils, System.Math, System.Classes;
type FileChar = AnsiChar; FileString = AnsiString; // for non-Unicode files
// type FileChar = WideChar; FileString = UnicodeString;// for UTF16 and UCS-2 files
const FileCharSize = SizeOf(FileChar);
// somewhere later in the code add: Assert(FileCharSize = SizeOf(FileString[1]);
function ReadLastLine(const FileName: String): FileString; overload; forward;
const PageSize = 4*1024;
// the minimal read atom of most modern HDD and the memory allocation atom of Win32
// since the chances your file would have lines longer than 4Kb are very small - I would not increase it to several atoms.
function ReadLastLine(const Lines: TStringDynArray): FileString; overload;
var i: integer;
begin
Result := '';
i := High(Lines);
if i < Low(Lines) then exit; // empty array - empty file
Result := Lines[i];
if Result > '' then exit; // we got the line
Dec(i); // skip the empty ghost line, in case last line was CRLF-terminated
if i < Low(Lines) then exit; // that ghost was the only line in the empty file
Result := Lines[i];
end;
// scan for EOLs in not-yet-scanned part
function FindLastLine(buffer: TArray<FileChar>; const OldRead : Integer;
const LastChunk: Boolean; out Line: FileString): boolean;
var i, tailCRLF: integer; c: FileChar;
begin
Result := False;
if Length(Buffer) = 0 then exit;
i := High(Buffer);
tailCRLF := 0; // test for trailing CR/LF
if Buffer[i] = ^J then begin // LF - single, or after CR
Dec(i);
Inc(tailCRLF);
end;
if (i >= Low(Buffer)) and (Buffer[i] = ^M) then begin // CR, alone or before LF
Inc(tailCRLF);
end;
i := High(Buffer) - Max(OldRead, tailCRLF);
if i - Low(Buffer) < 0 then exit; // no new data to read - results would be like before
if OldRead > 0 then Inc(i); // the CR/LF pair could be sliced between new and previous buffer - so need to start a bit earlier
for i := i downto Low(Buffer) do begin
c := Buffer[i];
if (c=^J) or (c=^M) then begin // found EOL
SetString( Line, @Buffer[i+1], High(Buffer) - tailCRLF - i);
exit(True);
end;
end;
// we did not find non-terminating EOL in the buffer (except maybe trailing),
// now we should ask for more file content, if there is still left any
// or take the entire file (without trailing EOL if any)
if LastChunk then begin
SetString( Line, @Buffer[ Low(Buffer) ], Length(Buffer) - tailCRLF);
Result := true;
end;
end;
function ReadLastLine(const FileName: String): FileString; overload;
var Buffer, tmp: TArray<FileChar>;
// dynamic arrays - eases memory management and protect from stack corruption
FS: TFileStream; FSize, NewPos: Int64;
OldRead, NewLen : Integer; EndOfFile: boolean;
begin
Result := '';
FS := TFile.OpenRead(FileName);
try
FSize := FS.Size;
if FSize <= PageSize then begin // small file, we can be lazy!
FreeAndNil(FS); // free the handle and avoid double-free in finally
Result := ReadLastLine( TFile.ReadAllLines( FileName, TEncoding.ANSI ));
// or TEncoding.UTF16
// warning - TFIle is not share-aware, if the file is being written to by another app
exit;
end;
SetLength( Buffer, PageSize div FileCharSize);
OldRead := 0;
repeat
NewPos := FSize - Length(Buffer)*FileCharSize;
EndOfFile := NewPos <= 0;
if NewPos < 0 then NewPos := 0;
FS.Position := NewPos;
FS.ReadBuffer( Buffer[Low(Buffer)], (Length(Buffer) - OldRead)*FileCharSize);
if FindLastLine(Buffer, OldRead, EndOfFile, Result) then
exit; // done !
tmp := Buffer; Buffer := nil; // flip-flop: preparing to broaden our mouth
OldRead := Length(tmp); // need not to re-scan the tail again and again when expanding our scanning range
NewLen := Min( 2*Length(tmp), FSize div FileCharSize );
SetLength(Buffer, NewLen); // this may trigger EOutOfMemory...
Move( tmp[Low(tmp)], Buffer[High(Buffer)-OldRead+1], OldRead*FileCharSize);
tmp := nil; // free old buffer
until EndOfFile;
finally
FS.Free;
end;
end;
PS。请注意一个额外的特殊情况 - 如果您使用 Unicode 字符(两个字节的字符)并给出奇数长度的文件(3 个字节、5 个字节等)-您将永远无法扫描起始的单个字节(半宽字符).也许你应该在那里添加额外的守卫,比如 Assert( 0 = FS.Size mod FileCharSize)
PPS。根据经验,您最好不要使用 class 形式的这些函数 - 因为为什么要混合它们?一般来说,您应该将关注点分成小块。读取文件与用户交互无关 - 因此最好将其卸载到额外的 UNIT。然后,您将能够在主线程或多线程应用程序中以一种形式或 10 种形式使用该单元的函数。就像 LEGO 零件一样 - 它们体积小且独立,为您提供灵活性。
PPPS。这里的另一种方法是使用 内存映射文件 。 Google Delphi 的 MMF 实现以及有关 MMF 方法的优点和问题的文章。我个人认为重写上面的代码以使用 MMF 会大大简化它,删除几个 "special cases" 和麻烦的内存复制触发器。 OTOH 它会要求你对指针算法非常严格。
我需要读取一些非常大的文本文件中的最后一行(以从数据中获取时间戳)。 TStringlist 是一种简单的方法,但它 returns 内存不足错误。我正在尝试使用 seek 和 blockread,但是缓冲区中的字符都是无意义的。这与unicode有关吗?
Function TForm1.ReadLastLine2(FileName: String): String;
var
FileHandle: File;
s,line: string;
ok: 0..1;
Buf: array[1..8] of Char;
k: longword;
i,ReadCount: integer;
begin
AssignFile (FileHandle,FileName);
Reset (FileHandle); // or for binary files: Reset (FileHandle,1);
ok := 0;
k := FileSize (FileHandle);
Seek (FileHandle, k-1);
s := '';
while ok<>1 do begin
BlockRead (FileHandle, buf, SizeOf(Buf)-1, ReadCount); //BlockRead ( var FileHandle : File; var Buffer; RecordCount : Integer {; var RecordsRead : Integer} ) ;
if ord (buf[1]) <>13 then //Arg to integer
s := s + buf[1]
else
ok := ok + 1;
k := k-1;
seek (FileHandle,k);
end;
CloseFile (FileHandle);
// Reverse the order in the line read
setlength (line,length(s));
for i:=1 to length(s) do
line[length(s) - i+1 ] := s[i];
Result := Line;
end;
基于www.delphipages.com/forum/showthread.php?t=102965
测试文件是我在 excel 中创建的一个简单的 CSV(这不是我最终需要阅读的 100MB)。
a,b,c,d,e,f,g,h,i,j,blank
A,B,C,D,E,F,G,H,I,J,blank
1,2,3,4,5,6,7,8,9,0,blank
Mary,had,a,little,lamb,His,fleece,was,white,as,snow
And,everywhere,that,Mary,went,The,lamb,was,sure,to,go
刚刚想到了一个新的解决方案。
同样,可能会有更好的,但这个是我想到的最好的。
function GetLastLine(textFilePath: string): string;
var
list: tstringlist;
begin
list := tstringlist.Create;
try
list.LoadFromFile(textFilePath);
result := list[list.Count-1];
finally
list.free;
end;
end;
你的char类型是两个字节,所以buffer是16字节。然后使用 blockread 将 sizeof(buffer)-1 字节读入其中,并检查前 2 个字节的字符是否等于 #13。
sizeof(buffer)-1 是狡猾的(那个 -1 来自哪里?),其余的是有效的,但前提是您的输入文件是 utf16。
另外你每次读8个(或16个)字符,但只比较一个然后再次查找。这也不太符合逻辑。
如果您的编码不是utf16,我建议您将缓冲区元素的类型更改为ansichar并删除-1
响应 kopiks 的建议,我想出了如何使用 TFilestream 来完成它,它可以与简单的测试文件一起工作,尽管当我在各种 csv 文件上使用它时可能还有一些 tweeks。另外,我并没有声称这是最有效的方法。
procedure TForm1.Button6Click(Sender: TObject);
Var
StreamSize, ApproxNumRows : Integer;
TempStr : String;
begin
if OpenDialog1.Execute then begin
TempStr := ReadLastLineOfTextFile(OpenDialog1.FileName,StreamSize, ApproxNumRows);
// TempStr := ReadFileStream('c:\temp\CSVTestFile.csv');
ShowMessage ('approximately '+ IntToStr(ApproxNumRows)+' Rows');
ListBox1.Items.Add(TempStr);
end;
end;
Function TForm1.ReadLastLineOfTextFile(const FileName: String; var StreamSize, ApproxNumRows : Integer): String;
const
MAXLINELENGTH = 256;
var
Stream: TFileStream;
BlockSize,CharCount : integer;
Hash13Found : Boolean;
Buffer : array [0..MAXLINELENGTH] of AnsiChar;
begin
Hash13Found := False;
Result :='';
Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
StreamSize := Stream.size;
if StreamSize < MAXLINELENGTH then
BlockSize := StreamSize
Else
BlockSize := MAXLINELENGTH;
// for CharCount := 0 to Length(Buffer)-1 do begin
// Buffer[CharCount] := #0; // zeroing the buffer can aid diagnostics
// end;
CharCount := 0;
Repeat
Stream.Seek(-(CharCount+3), 2); //+3 misses out the #0,#10,#13 at the end of the file
Stream.Read( Buffer[CharCount], 1);
Result := String(Buffer[CharCount]) + result;
if Buffer[CharCount] =#13 then
Hash13Found := True;
Inc(CharCount);
Until Hash13Found OR (CharCount = BlockSize);
ShowMessage(Result);
ApproxNumRows := Round(StreamSize / CharCount);
end;
您确实必须以大块从尾部到头部读取文件。
由于它太大,不适合内存 - 然后从头到尾逐行读取它会非常慢。使用 ReadLn
- 慢两倍。
您还必须准备好最后一行可能以 EOL 结束,也可能不结束。
就我个人而言,我还会考虑三种可能的 EOL 序列:
- CR/LF 又名#13#10=^M^J - DOS/Windows 风格
- 没有 LF 的 CR - 只是 #13=^M - 经典 MacOS 文件
- 没有 CR 的 LF - 只是 #10=^J - UNIX 风格,包括 MacOS 版本 10
如果您确定您的 CSV 文件只会由本机 Windows 程序生成,那么可以安全地假设使用了完整的 CR/LF。但是,如果可以有其他 Java 程序、非 Windows 平台、移动程序——我就不太确定了。当然,没有 LF 的纯 CR 是所有情况中可能性最小的。
uses System.IOUtils, System.Math, System.Classes;
type FileChar = AnsiChar; FileString = AnsiString; // for non-Unicode files
// type FileChar = WideChar; FileString = UnicodeString;// for UTF16 and UCS-2 files
const FileCharSize = SizeOf(FileChar);
// somewhere later in the code add: Assert(FileCharSize = SizeOf(FileString[1]);
function ReadLastLine(const FileName: String): FileString; overload; forward;
const PageSize = 4*1024;
// the minimal read atom of most modern HDD and the memory allocation atom of Win32
// since the chances your file would have lines longer than 4Kb are very small - I would not increase it to several atoms.
function ReadLastLine(const Lines: TStringDynArray): FileString; overload;
var i: integer;
begin
Result := '';
i := High(Lines);
if i < Low(Lines) then exit; // empty array - empty file
Result := Lines[i];
if Result > '' then exit; // we got the line
Dec(i); // skip the empty ghost line, in case last line was CRLF-terminated
if i < Low(Lines) then exit; // that ghost was the only line in the empty file
Result := Lines[i];
end;
// scan for EOLs in not-yet-scanned part
function FindLastLine(buffer: TArray<FileChar>; const OldRead : Integer;
const LastChunk: Boolean; out Line: FileString): boolean;
var i, tailCRLF: integer; c: FileChar;
begin
Result := False;
if Length(Buffer) = 0 then exit;
i := High(Buffer);
tailCRLF := 0; // test for trailing CR/LF
if Buffer[i] = ^J then begin // LF - single, or after CR
Dec(i);
Inc(tailCRLF);
end;
if (i >= Low(Buffer)) and (Buffer[i] = ^M) then begin // CR, alone or before LF
Inc(tailCRLF);
end;
i := High(Buffer) - Max(OldRead, tailCRLF);
if i - Low(Buffer) < 0 then exit; // no new data to read - results would be like before
if OldRead > 0 then Inc(i); // the CR/LF pair could be sliced between new and previous buffer - so need to start a bit earlier
for i := i downto Low(Buffer) do begin
c := Buffer[i];
if (c=^J) or (c=^M) then begin // found EOL
SetString( Line, @Buffer[i+1], High(Buffer) - tailCRLF - i);
exit(True);
end;
end;
// we did not find non-terminating EOL in the buffer (except maybe trailing),
// now we should ask for more file content, if there is still left any
// or take the entire file (without trailing EOL if any)
if LastChunk then begin
SetString( Line, @Buffer[ Low(Buffer) ], Length(Buffer) - tailCRLF);
Result := true;
end;
end;
function ReadLastLine(const FileName: String): FileString; overload;
var Buffer, tmp: TArray<FileChar>;
// dynamic arrays - eases memory management and protect from stack corruption
FS: TFileStream; FSize, NewPos: Int64;
OldRead, NewLen : Integer; EndOfFile: boolean;
begin
Result := '';
FS := TFile.OpenRead(FileName);
try
FSize := FS.Size;
if FSize <= PageSize then begin // small file, we can be lazy!
FreeAndNil(FS); // free the handle and avoid double-free in finally
Result := ReadLastLine( TFile.ReadAllLines( FileName, TEncoding.ANSI ));
// or TEncoding.UTF16
// warning - TFIle is not share-aware, if the file is being written to by another app
exit;
end;
SetLength( Buffer, PageSize div FileCharSize);
OldRead := 0;
repeat
NewPos := FSize - Length(Buffer)*FileCharSize;
EndOfFile := NewPos <= 0;
if NewPos < 0 then NewPos := 0;
FS.Position := NewPos;
FS.ReadBuffer( Buffer[Low(Buffer)], (Length(Buffer) - OldRead)*FileCharSize);
if FindLastLine(Buffer, OldRead, EndOfFile, Result) then
exit; // done !
tmp := Buffer; Buffer := nil; // flip-flop: preparing to broaden our mouth
OldRead := Length(tmp); // need not to re-scan the tail again and again when expanding our scanning range
NewLen := Min( 2*Length(tmp), FSize div FileCharSize );
SetLength(Buffer, NewLen); // this may trigger EOutOfMemory...
Move( tmp[Low(tmp)], Buffer[High(Buffer)-OldRead+1], OldRead*FileCharSize);
tmp := nil; // free old buffer
until EndOfFile;
finally
FS.Free;
end;
end;
PS。请注意一个额外的特殊情况 - 如果您使用 Unicode 字符(两个字节的字符)并给出奇数长度的文件(3 个字节、5 个字节等)-您将永远无法扫描起始的单个字节(半宽字符).也许你应该在那里添加额外的守卫,比如 Assert( 0 = FS.Size mod FileCharSize)
PPS。根据经验,您最好不要使用 class 形式的这些函数 - 因为为什么要混合它们?一般来说,您应该将关注点分成小块。读取文件与用户交互无关 - 因此最好将其卸载到额外的 UNIT。然后,您将能够在主线程或多线程应用程序中以一种形式或 10 种形式使用该单元的函数。就像 LEGO 零件一样 - 它们体积小且独立,为您提供灵活性。
PPPS。这里的另一种方法是使用 内存映射文件 。 Google Delphi 的 MMF 实现以及有关 MMF 方法的优点和问题的文章。我个人认为重写上面的代码以使用 MMF 会大大简化它,删除几个 "special cases" 和麻烦的内存复制触发器。 OTOH 它会要求你对指针算法非常严格。