如何使用变量 window 顺序读取 5gb 文件

how to read a 5gb file sequentially using a variable window

正在处理 Common Crawl warc 文件。这些是 5gb 未压缩的。里面有文字,xml 和 warc headers.

这是我特别有问题的代码:

wstring sub = buffer->substr(windowStart, windowSize);

这给了我错误,“表达式必须有一个指向 class 类型的指针”。我认为这是因为标签是指向该大小的堆内存位置的指针。因此,我无法对它进行 运行 任何字符串操作。但是 -> 运算符应该得到它指向的内容,所以我可以 运行 像 substr?

我正在使用这样一个简单的缓冲区,因为我知道将文件(MapViewOfFile 等)映射到内存更适合随机访问。如果我只需要顺序读取,它实际上会更慢?

我想按顺序读取文件。为了提高速度,将文件分块读取到 RAM,然后在从磁盘获取另一个块之前处理 ram 块。比如每个块 1mb,等等

我不会处理所有的 xml,有些会被跳过。抓住文本和一些 warc headers,跳过其余部分。

想法是在 ram 中的文件块中使用滑动 window。 window 从块中最后一次停止的地方开始。 window 循环增长。一旦达到足够大的大小,就会使用正则表达式检查是否有任何匹配的标签、headers 或文本。如果是这样,要么只跳过那个标签,跳过这么多字符(在某些情况下,如果遇到特定类型的 warc header,则跳过 500 个字符),写入该标签(如果我想保留它的话),等等.

当 window 匹配时,windowStart 设置为等于 windowEnd 并再次开始扩展 window 以查找下一个模式。缓冲区结束后,它会跟踪任何部分标签并从磁盘重新填充缓冲区。

我 运行 遇到的主要问题是如何在滑动时进行 window。缓冲区是指向堆内存中某个位置的指针。由于某种原因,我无法在其上使用句点或 -> 运算符。所以我不能使用 substr、regex 等。我可以复制一份,但我真的需要这样做吗?

到目前为止,这是我的代码:

BOOL pageActive = FALSE;
BOOL xml = FALSE;
#define MAXBUFFERSIZE 1024
#define MAXTAGSIZE 64
DWORD windowStart = 0; DWORD windowEnd = 15; DWORD windowSize = 15; // buffer window containing tag candidate
wstring windowCopy;
DWORD bufferSize = MAXBUFFERSIZE;
_int64 fileRemaining;

HANDLE hFile;
DWORD  dwBytesRead = 0;
OVERLAPPED ol = { 0 };
LARGE_INTEGER dwPosition;

TCHAR* buffer;

hFile = CreateFile(
    inputFilePath,         // file to open
    GENERIC_READ,          // open for reading
    FILE_SHARE_READ | FILE_SHARE_WRITE,       // share for reading and writing
    NULL,                  // default security
    OPEN_EXISTING,         // existing file only
    FILE_ATTRIBUTE_NORMAL, // normal file    | FILE_FLAG_OVERLAPPED
    NULL);                 // no attr. template

if (hFile == INVALID_HANDLE_VALUE)
{
    DisplayErrorBox((LPWSTR)L"CreateFile");
    return 0;
}

LARGE_INTEGER size;
GetFileSizeEx(hFile, &size);

_int64 fileSize = (__int64)size.QuadPart;
double gigabytes = fileSize * 9.3132e-10;
sendToReportWindow(L"file size: %lld bytes \(%.1f gigabytes\)\n", fileSize, gigabytes);

if(fileSize > MAXBUFFERSIZE)
{
    TCHAR* buffer = new TCHAR[MAXBUFFERSIZE]; buffer[0] = 0;
    //sendToReportWindow(L"buffer is MAXBUFFERSIZE\n");
}
else
{
    TCHAR* buffer = new TCHAR[fileSize]; buffer[0] = 0;
    //sendToReportWindow(L"buffer is fileSize + 1\n");
}
fileRemaining = fileSize;

sendToReportWindow(L"file remaining: %lld bytes\n", fileRemaining);

//TCHAR readBuffer[MAXBUFFERSIZE] = { 0 };

while (fileRemaining)                                       // outer loop. while file remaining, read file chunk to buffer
{

    if (bufferSize > fileRemaining)                         // as fileremaining gets smaller as file is processed, it eventually is smaller than the buffer
        bufferSize = fileRemaining;

    if (FALSE == ReadFile(hFile, buffer, bufferSize -1, &dwBytesRead, NULL))
    //if (FALSE == ReadFile(hFile, readBuffer, bufferSize -1, &dwBytesRead, NULL))
    {
        sendToReportWindow(L"file read failed\n");
        CloseHandle(hFile);
        return 0;
    }

    fileRemaining -= bufferSize;                            //fileRemaining is size of the file left after this buffer is processed
    sendToReportWindow(L"outer loop\n");

    // declare and clear span char array[maxTagSize]   // size of array is maximum tag size (64). This is for unused windows. Raw text is not considered a tag

    while (windowEnd < bufferSize)              //inner loop. while unused data remains in buffer   
    {
        windowSize = windowEnd - windowStart;

        // windowsize += span.size

//                The window start position remains fixed as the window size is slowly increased. Once it is large enough, some conditional below begin to look at it.If any triggers, they eat that window. Setting the new start position at the previous end position.
//                If the buffer ends mid - tag, the contents of the window are copy to the span array variable

            // Page state. Tags in header

//                If !pageActive
//                if windowSize > 7 (warc / 1.0)
//                    Convert chunk to string for regex ? (prepend span array from previous loop)
//                    If Regex chunk WARC - Type : response pageActive = true; wstart = wend, clear span
//                    Elseif regex chunk other warc - type      clear span; skip ahead 550 for start, 565 for end
//                    Continue

//                    // page is active
//
//                    if windowSize > 6
 //                        If regex chunk WARC / \d     pageActive = false; xml = false; wstart = wend, clear span; Continue

//                        If !xml
//                        If windowSize > 15 (warc date)
//                        Convert chunk to string for regex ? (prepend span array from previous loop)
//                        If regex  chunk warc date     output warc date; wstart = wend, clear span
//                        elseIf regex chunk warc uri   output warc uri; wstart = wend, clear span; skip ahead 300
//                        ElseIf end of window has \n“ < ”  Xml = true  // any window size where xml is not started
//                        continue              // whatever triggers in this !xml block, always continue    

//                    // page and xml are active
//                    // only send to output bare text when a [^\n]< or newline is reached
        // test where just outputs all the tags or text it finds
        // pull out any <.+> sequences or any >.+< sequences
        // multibyte conversion, build string of window
        //LPCCH readBuffer = { "ab" }; // = buffer[2];

        // std::string str2 = str.substr (3,5);    
        //wstring sub = (wstring)readBuffer.substr(0,5);          // substring of buffer
        wstring sub = buffer->substr(windowStart, windowSize);
        TCHAR converted[64] = { 0 };
        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, (LPCCH)&sub, -1, converted, MAXBUFFERSIZE);
        //MultiByteToWideChar(CP_ACP, MB_COMPOSITE, (LPCCH)buffer, MAXBUFFERSIZE, converted, 1);             // convert between the utf encoding of the file to the utf encoding of windows?
        sendToReportWindow(L"windowStart:%d windowEnd:%d char:%s\n", windowStart, windowEnd, converted);
        //sendToReportWindow((LPWSTR)buffer[windowStart]);
        windowStart = windowEnd;


//                    //Tags in body. Any chunk size

//                        Convert chunk to string for regex ? (prepend span array from previous loop)
//                        if regex chunk tag pattern            output pattern, wstart = wend, clear span
                        // nested tags? no
 //   windowEnd++;      // tests above did not bite. so increment end of window, increasing window size
    }   // inner loop: while windowEnd <buffersize

// end of buffer: load any unused window into span
//If windowEnd != windowStart       // window start did not get set to end by regex above
//Span = buffer(start – end)

//file progress indicator
//fileSize / fileRemaining x 0.01 // calculate percentage of file remaining with each buffer load
//print progress

//windowStart = 0; windowEnd = 1; windowSize = 1 // look at smaller pieces after first iteration (not in w header)
}   // outer loop. while fileRemaining

delete buffer;

Which give me the error, "expression must have a pointer to class type".

TCHAR没有substr.

这样的方法

修改:

  wstring str(buffer);
  wstring sub = str.substr(windowStart, windowSize);

其他需要修改的代码:

MultiByteToWideChar(CP_ACP, MB_COMPOSITE, (LPCCH)&sub, -1, converted, MAXBUFFERSIZE);       
sendToReportWindow(L"windowStart:%d windowEnd:%d char:%s\n", windowStart, windowEnd, converted);

=> sendToReportWindow(L"windowStart:%d windowEnd:%d char:%s\n", windowStart, windowEnd, sub.c_str()); //use string::c_str method

buffer = new TCHAR[MAXBUFFERSIZE]; buffer[0] = 0;   //remove TCHAR*
buffer = new TCHAR[fileSize]; buffer[0] = 0;    //remove TCHAR*

I am not processing all the xml, some will be skipped. grabbing the text and some of the warc headers, skipping the rest.

可以使用string::find抓取warc header。(确保warc header是唯一的)

ep: Check if a string contains a string in C++

顺便说一句,无论你使用Unicode Character还是Multi-Byte Character,你都需要保持单一的编码格式。