如何读取包含 Unicode 内容的文件

How to read a file with Unicode contents

如何使用 C/C++ 读取包含 Unicode 内容的文件?

我使用 ReadFile 函数读取了一个包含 Unicode 内容的文件,但它没有真正的输出。 我想要一个包含文件所有内容的缓冲区

我使用这个代码:

#include <Windows.h>

int main()
{
    HANDLE hndlRead;
    OVERLAPPED ol = {0};

    CHAR* szReadBuffer;
    INT fileSize;

    hndlRead = CreateFileW(L"file", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (hndlRead != INVALID_HANDLE_VALUE)
    {
        fileSize = GetFileSize(hndlRead, NULL);
        szReadBuffer = (CHAR*) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*2);
        DWORD nb=0;
        int nSize=fileSize;
        if (szReadBuffer != NULL)
        {
            ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
        }
    }

    return 0;
}

有什么方法可以正确读取这个文件吗?

这是 nb 和 szReadBuffer:

这是我在notpad++中的文件内容:

您的代码工作正常。它将 rdp 文件逐字读取到内存中。

困扰你的是rdp文件开头的BOM (byte order mark)

如果您使用文本编辑器(例如记事本)查看 rdp 文件,您将看到:

screen mode id:i:2
use multimon:i:0
desktopwidth:i:2560
desktopheight:i:1600
....

如果您使用十六进制编辑器查看 rdp 文件,您将看到:

0000 FFFE 7300 6300 7200 6500 6500 6E00 2000 ..s.c.r.e.e.n. .
0008 6D00 6F00 6400 6500 2000 6900 6400 3A00 m.o.d.e. .i.d...
....

FFFE是字节顺序标记,表示该文件是小端UNICODE编码的文本文件,所以每个字符占用2个字节而不是1个字节。

一旦文件读入内存,您将得到这个(0x00318479 是 szReadBuffer 指向的地址):

  • 顺便说一句 1:您应该在读取文件后调用 CloseHandle(hndlRead)
  • 顺便说一句 2:您不应该使用 HeapAlloc,而应该使用 malloccalloc

更正的程序:

#include <Windows.h>

int main()
{
  HANDLE hndlRead;

  WCHAR* szReadBuffer;   // WCHAR instead of CHAR
  INT fileSize;

  hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

  if (hndlRead != INVALID_HANDLE_VALUE)
  {
    fileSize = GetFileSize(hndlRead, NULL);
    szReadBuffer = (WCHAR*)calloc(fileSize + sizeof(WCHAR), 1);  // + sizeof(WCHAR) for NUL string terminator
    DWORD nb = 0;
    int nSize = fileSize;
    if (szReadBuffer != NULL)
    {
      ReadFile(hndlRead, szReadBuffer, nSize, &nb, NULL);
    }

    CloseHandle(hndlRead);   // close what we have opened

    WCHAR *textwithoutbom = szReadBuffer + 1;  // skip BOM

    // put breakpoint here and inspect textwithoutbom

    free(szReadBuffer);  // free what we have allocated
  }

  return 0;
}

正如@MickaelWalz 所建议的,RDP 文件的文件格式现在是 Unicode。

这是一种读取和显示该文件内容的方法:

  • Use wchar_t * buffer instad of CHAR * or BYTE * buffer.
  • Check if the ReadFile() has been successfully performed bRet == True and nSize == nb.
  • Start to the second WCHAR to exclude the 0xFFFE Unicode identifier.
  • Don't forget to close your file CloseHandle(hndlRead); !
#include <stdio.h>
#include <iostream>
#include <Windows.h>

int main()
{
    HANDLE hndlRead;
    OVERLAPPED ol = {0};

    //BYTE* szReadBuffer;
    INT fileSize;
    wchar_t *szReadBuffer;

    hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (hndlRead != INVALID_HANDLE_VALUE)
    {
        fileSize = GetFileSize(hndlRead, NULL);
        szReadBuffer = (wchar_t *) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*sizeof(wchar_t));
        DWORD nb=0;
        int nSize=fileSize;
        BOOL bRet;
        if (szReadBuffer != NULL)
        {
            bRet = ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
            if ((bRet) && (nb == nSize)) {
                printf("%02X,%02X... %02X\n",szReadBuffer[0],szReadBuffer[1],szReadBuffer[nb-1]);
                std::wcout << L"info " << (szReadBuffer+1) << L" " << nb << std::endl;
            }
        }
        CloseHandle(hndlRead);
    }

    return 0;
}