如何读取包含 Unicode 内容的文件
How to read a file with Unicode contents
如何使用 C/C++ 读取包含 Unicode 内容的文件?
我使用 ReadFile 函数读取了一个包含 Unicode 内容的文件,但它没有真正的输出。 我想要一个包含文件所有内容的缓冲区
我使用这个代码:
#include <Windows.h>
int main()
{
HANDLE hndlRead;
OVERLAPPED ol = {0};
CHAR* szReadBuffer;
INT fileSize;
hndlRead = CreateFileW(L"file", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hndlRead != INVALID_HANDLE_VALUE)
{
fileSize = GetFileSize(hndlRead, NULL);
szReadBuffer = (CHAR*) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*2);
DWORD nb=0;
int nSize=fileSize;
if (szReadBuffer != NULL)
{
ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
}
}
return 0;
}
有什么方法可以正确读取这个文件吗?
这是 nb 和 szReadBuffer:
这是我在notpad++中的文件内容:
您的代码工作正常。它将 rdp 文件逐字读取到内存中。
困扰你的是rdp文件开头的BOM (byte order mark)
如果您使用文本编辑器(例如记事本)查看 rdp 文件,您将看到:
screen mode id:i:2
use multimon:i:0
desktopwidth:i:2560
desktopheight:i:1600
....
如果您使用十六进制编辑器查看 rdp 文件,您将看到:
0000 FFFE 7300 6300 7200 6500 6500 6E00 2000 ..s.c.r.e.e.n. .
0008 6D00 6F00 6400 6500 2000 6900 6400 3A00 m.o.d.e. .i.d...
....
FFFE
是字节顺序标记,表示该文件是小端UNICODE编码的文本文件,所以每个字符占用2个字节而不是1个字节。
一旦文件读入内存,您将得到这个(0x00318479 是 szReadBuffer
指向的地址):
- 顺便说一句 1:您应该在读取文件后调用
CloseHandle(hndlRead)
。
- 顺便说一句 2:您不应该使用
HeapAlloc
,而应该使用 malloc
或 calloc
。
更正的程序:
#include <Windows.h>
int main()
{
HANDLE hndlRead;
WCHAR* szReadBuffer; // WCHAR instead of CHAR
INT fileSize;
hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hndlRead != INVALID_HANDLE_VALUE)
{
fileSize = GetFileSize(hndlRead, NULL);
szReadBuffer = (WCHAR*)calloc(fileSize + sizeof(WCHAR), 1); // + sizeof(WCHAR) for NUL string terminator
DWORD nb = 0;
int nSize = fileSize;
if (szReadBuffer != NULL)
{
ReadFile(hndlRead, szReadBuffer, nSize, &nb, NULL);
}
CloseHandle(hndlRead); // close what we have opened
WCHAR *textwithoutbom = szReadBuffer + 1; // skip BOM
// put breakpoint here and inspect textwithoutbom
free(szReadBuffer); // free what we have allocated
}
return 0;
}
正如@MickaelWalz 所建议的,RDP 文件的文件格式现在是 Unicode。
这是一种读取和显示该文件内容的方法:
- Use
wchar_t *
buffer instad of CHAR *
or BYTE *
buffer.
- Check if the
ReadFile()
has been successfully performed bRet == True
and nSize == nb
.
- Start to the second WCHAR to exclude the 0xFFFE Unicode identifier.
- Don't forget to close your file
CloseHandle(hndlRead);
!
#include <stdio.h>
#include <iostream>
#include <Windows.h>
int main()
{
HANDLE hndlRead;
OVERLAPPED ol = {0};
//BYTE* szReadBuffer;
INT fileSize;
wchar_t *szReadBuffer;
hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hndlRead != INVALID_HANDLE_VALUE)
{
fileSize = GetFileSize(hndlRead, NULL);
szReadBuffer = (wchar_t *) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*sizeof(wchar_t));
DWORD nb=0;
int nSize=fileSize;
BOOL bRet;
if (szReadBuffer != NULL)
{
bRet = ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
if ((bRet) && (nb == nSize)) {
printf("%02X,%02X... %02X\n",szReadBuffer[0],szReadBuffer[1],szReadBuffer[nb-1]);
std::wcout << L"info " << (szReadBuffer+1) << L" " << nb << std::endl;
}
}
CloseHandle(hndlRead);
}
return 0;
}
如何使用 C/C++ 读取包含 Unicode 内容的文件?
我使用 ReadFile 函数读取了一个包含 Unicode 内容的文件,但它没有真正的输出。 我想要一个包含文件所有内容的缓冲区
我使用这个代码:
#include <Windows.h>
int main()
{
HANDLE hndlRead;
OVERLAPPED ol = {0};
CHAR* szReadBuffer;
INT fileSize;
hndlRead = CreateFileW(L"file", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hndlRead != INVALID_HANDLE_VALUE)
{
fileSize = GetFileSize(hndlRead, NULL);
szReadBuffer = (CHAR*) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*2);
DWORD nb=0;
int nSize=fileSize;
if (szReadBuffer != NULL)
{
ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
}
}
return 0;
}
有什么方法可以正确读取这个文件吗?
这是 nb 和 szReadBuffer:
这是我在notpad++中的文件内容:
您的代码工作正常。它将 rdp 文件逐字读取到内存中。
困扰你的是rdp文件开头的BOM (byte order mark)
如果您使用文本编辑器(例如记事本)查看 rdp 文件,您将看到:
screen mode id:i:2
use multimon:i:0
desktopwidth:i:2560
desktopheight:i:1600
....
如果您使用十六进制编辑器查看 rdp 文件,您将看到:
0000 FFFE 7300 6300 7200 6500 6500 6E00 2000 ..s.c.r.e.e.n. .
0008 6D00 6F00 6400 6500 2000 6900 6400 3A00 m.o.d.e. .i.d...
....
FFFE
是字节顺序标记,表示该文件是小端UNICODE编码的文本文件,所以每个字符占用2个字节而不是1个字节。
一旦文件读入内存,您将得到这个(0x00318479 是 szReadBuffer
指向的地址):
- 顺便说一句 1:您应该在读取文件后调用
CloseHandle(hndlRead)
。 - 顺便说一句 2:您不应该使用
HeapAlloc
,而应该使用malloc
或calloc
。
更正的程序:
#include <Windows.h>
int main()
{
HANDLE hndlRead;
WCHAR* szReadBuffer; // WCHAR instead of CHAR
INT fileSize;
hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hndlRead != INVALID_HANDLE_VALUE)
{
fileSize = GetFileSize(hndlRead, NULL);
szReadBuffer = (WCHAR*)calloc(fileSize + sizeof(WCHAR), 1); // + sizeof(WCHAR) for NUL string terminator
DWORD nb = 0;
int nSize = fileSize;
if (szReadBuffer != NULL)
{
ReadFile(hndlRead, szReadBuffer, nSize, &nb, NULL);
}
CloseHandle(hndlRead); // close what we have opened
WCHAR *textwithoutbom = szReadBuffer + 1; // skip BOM
// put breakpoint here and inspect textwithoutbom
free(szReadBuffer); // free what we have allocated
}
return 0;
}
正如@MickaelWalz 所建议的,RDP 文件的文件格式现在是 Unicode。
这是一种读取和显示该文件内容的方法:
- Use
wchar_t *
buffer instad ofCHAR *
orBYTE *
buffer.- Check if the
ReadFile()
has been successfully performedbRet == True
andnSize == nb
.- Start to the second WCHAR to exclude the 0xFFFE Unicode identifier.
- Don't forget to close your file
CloseHandle(hndlRead);
!
#include <stdio.h>
#include <iostream>
#include <Windows.h>
int main()
{
HANDLE hndlRead;
OVERLAPPED ol = {0};
//BYTE* szReadBuffer;
INT fileSize;
wchar_t *szReadBuffer;
hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hndlRead != INVALID_HANDLE_VALUE)
{
fileSize = GetFileSize(hndlRead, NULL);
szReadBuffer = (wchar_t *) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*sizeof(wchar_t));
DWORD nb=0;
int nSize=fileSize;
BOOL bRet;
if (szReadBuffer != NULL)
{
bRet = ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
if ((bRet) && (nb == nSize)) {
printf("%02X,%02X... %02X\n",szReadBuffer[0],szReadBuffer[1],szReadBuffer[nb-1]);
std::wcout << L"info " << (szReadBuffer+1) << L" " << nb << std::endl;
}
}
CloseHandle(hndlRead);
}
return 0;
}