提高读取csv文件的速度C++

Question

我创建此代码是为了读取和过滤我的 csv 文件。它的工作方式就像我希望它适用于小文件一样。但是我刚刚尝试了一个 200k 行大小的文件，大约需要 4 分钟，这对我的用例来说太长了。

经过一些测试并修复了一些非常愚蠢的问题后，我将时间缩短到了 3 分钟。我发现大约一半的时间花在读取文件上，一半的时间花在生成结果向量上。

有什么方法可以提高我的程序速度吗？特别是从 csv 部分读取？我现在真的没有想法。如果有任何帮助，我将不胜感激。

编辑：过滤器正在按特定列中的时间范围或时间范围和过滤器词过滤数据，并将数据输出到结果字符串向量中。

我的 CSV 文件是这样的->

Headers 是：

ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum

数据为：

523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4

std::ifstream input_file(strComplPath, std::ios::in);

int counter = 0;
while (std::getline(input_file, record))
{
    istringstream line(record);
    while (std::getline(line, record, delimiter))
    {
        record.erase(remove(record.begin(), record.end(), '\"'), record.end());
        items.push_back(record);
        //cout << record;
    }

    csv_contents[counter] = items;
    items.clear();
    ++counter;
}
 

for (int i = 0; i < csv_contents.size(); i++) {
    string regexline = csv_contents[i][1];
    string endtime = time_upper_bound;
    string starttime = time_lower_bound;
    bool checkline = false;
    bool isInRange = false, isLater = false, isEarlier = false;

    // Check for faulty Data and replace it with an empty string 
    for (int oo = 0; oo < 8; oo++) {
        if (csv_contents[i][oo].rfind("#", 0) == 0) {
            csv_contents[i][oo] = "";
        }
    }

    if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
        filtertimeboth = true;
    }
    else if (regex_search(starttime, m, timestampformat)) {
        filterfromstart = true;
    }
    else if (regex_search(endtime, m, timestampformat)) {
        filtertoend = true;
    }
}

Answer 1

我不确定你的程序到底是什么瓶颈（我从问题的早期版本复制了你的代码）但是你有很多regex:es并且混合了阅读记录和post 处理。我建议您创建一个 class 来保存这些记录之一，称为 record，为 record 重载 operator>>，然后使用文件中的 std::copy_if您可以独立于阅读设计的过滤器。 post 在阅读通过过滤器的记录后 处理。

我做了一个小测试，在我的旧旋转磁盘上进行过滤时读取 200k 条记录需要 2 秒。我只使用 time_lower_bound 和 time_upper_bound 进行过滤，额外的检查当然会使速度变慢一些，但应该不会花费分钟 .

示例：

#include <algorithm> #include <chrono> #include <ctime> #include <fstream> #include <iomanip> #include <iostream> #include <iterator> #include <sstream> #include <string> #include <thread> #include <vector> // the suggested class to hold a record struct record { int ID; std::chrono::system_clock::time_point Timestamp; std::string ObjectID; std::string UserID; std::string Area; std::string Description; std::string Comment; std::string Checksum; };

// A free function to read a time_point from an `istream`: std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) { std::chrono::system_clock::time_point tp{}; // C++20: // std::chrono::from_stream(is, tp, fmt, nullptr, nullptr); // C++11 to C++17 version: std::tm tmtp{}; tmtp.tm_isdst = -1; if(is >> std::get_time(&tmtp, fmt)) { tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp)); } return tp; }

// The operator>> overload to read one `record` from an `istream`: std::istream& operator>>(std::istream& is, record& r) { is >> r.ID; r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above std::getline(is, r.ObjectID, ';'); std::getline(is, r.UserID, ';'); std::getline(is, r.Area, ';'); std::getline(is, r.Description, ';'); std::getline(is, r.Comment, ';'); std::getline(is, r.Checksum); return is; }

// An operator<< overload to print one `record`: std::ostream& operator<<(std::ostream& os, const record& r) { std::ostringstream oss; oss << r.ID; { // I only made a C++11 to C++17 version for this one: std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp); std::tm ts = *std::localtime(&time); oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.' << ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';'; } oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';' << r.Description << ';' << r.Comment << ';' << r.Checksum << '\n'; return os << oss.str(); }

// The reading and filtering part of `main` would then look like this: int main() { // not "void main()" std::istringstream time_lower_bound_s("20.05.2019 16:40:00"); std::istringstream time_upper_bound_s("20.05.2021 09:40:00"); // Your time boundaries as `std::chrono::system_clock::time_point`s - // again using the `to_tp` helper function: auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S"); auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S"); // Verify that the boundaries were parsed ok: if(time_lower_bound == std::chrono::system_clock::time_point{} || time_upper_bound == std::chrono::system_clock::time_point{}) { std::cerr << "failed to parse boundaries\n"; return 1; } std::ifstream is("data"); // whatever your file is called if(is) { std::vector<record> recs; // a vector with all the records // create your filter auto filter = [&time_lower_bound, &time_upper_bound](const record& r) { // Only copy those `record`s within the set boundaries. // You can add additional conditions here too. return r.Timestamp >= time_lower_bound && r.Timestamp <= time_upper_bound; }; // Copy those records that pass the filter: std::copy_if(std::istream_iterator<record>(is), std::istream_iterator<record>{}, std::back_inserter(recs), filter); // .. post process `recs` here ... // print result for(auto& r : recs) std::cout << r; } }

Answer 2

泰德已经给出了答案。我同时做了一个解决方案。那我再补充一下吧。

我创建了包含 50 万条记录的测试数据，并且在我的机器上所有解析都在不到 3 秒的时间内完成。

此外，我还创建了类。

速度将通过使用 std::move、增加输入缓冲区大小和使用 reserve 获得 std::vector。

请参阅下面的另一种解决方案。我省略了过滤。 Ted 已经展示过了。

#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <ctime>
#include <vector>
#include <chrono>
#include <sstream>
#include <algorithm>
#include <iterator>

constexpr size_t MaxLines = 600'000u;
constexpr size_t NumberOfLines = 500'000u;
const std::string fileName{ "test.csv" };

// Dummy rtoutine for writing a test file
void createFile() {
    if (std::ofstream ofs{ fileName }; ofs) {
        std::time_t ttt = 0;
        for (size_t k = 0; k < NumberOfLines; ++k) {
            std::time_t time = static_cast<time_t>(ttt);
            ttt += 1000;
            ofs << k << ';'
#pragma warning(suppress : 4996)
                << std::put_time(std::localtime(&time), "%d.%m.%Y  %H:%M") << ';'
                << k << ';'
                << "UserID" << k << ';'
                << "Area" << k << ';'
                << "Description" << k << ';'
                << "Comment" << k << ';'
                << "Checksum" << k << '\n';
        }
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing\n\n";
}


// We will create a bigger input buffer for our stream
constexpr size_t ifStreamBufferSize = 100'000u;
static char buffer[ifStreamBufferSize];


// Object oriented Model. Class for one record
struct Record {

    // Data
    long id{};
    std::tm time{};
    long objectId{};
    std::string userId{};
    std::string area{};
    std::string description{};
    std::string comment{};
    std::string checkSum{};

    // Methods
    // Extractor operator
    friend std::istream& operator >> (std::istream& is, Record& r) {

        // Read one complete line
        if (std::string line; std::getline(is, line)) {

            // Here we will stor the parts of the line after the split
            std::vector<std::string> parts{};

            // Convert line to istringstream for further extraction of line parts
            std::istringstream iss{ line };

            // One part of a line
            std::string part{};
            bool wrongData = false;

            // Split
            while (std::getline(iss, part, ';')) {

                // Check fpor error
                if (part[0] == '#') {
                    is.setstate(std::ios::failbit);
                    break;
                }
                // add part
                parts.push_back(std::move(part));
            }
            // If all was OK
            if (is) {
                // If we have enough parts
                if (parts.size() == 8) {

                    // Convert parts to target data in record
                    r.id = std::strtol(parts[0].c_str(), nullptr, 10);

                    std::istringstream ss{parts[1]};
                    ss >> std::get_time(& r.time, "%d.%m.%Y  %H:%M");
                    if (ss.fail()) 
                        is.setstate(std::ios::failbit);

                    r.objectId = std::strtol(parts[2].c_str(), nullptr, 10);

                    r.userId = std::move(parts[3]);

                    r.area = std::move(parts[4]);

                    r.description = std::move(parts[5]);

                    r.comment = std::move(parts[6]);

                    r.checkSum = std::move(parts[7]);
                }
                else is.setstate(std::ios::failbit);
            }
        }
        return is;
    }
    // Simple inserter function
    friend std::ostream& operator << (std::ostream& os, const Record& r) {
        return os << r.id << "   "
#pragma warning(suppress : 4996)
            << std::put_time(&r.time, "%d.%m.%Y  %H:%M") << "   "  
            << r.objectId << "   " << r.userId << "   " << r.area << "   " << r.description << "   " << r.comment << "   " << r.checkSum;
    }
};

// Data will hold all records
struct Data {

    // Data part
    std::vector<Record> records{};

    // Constructor will reserve space to avaoid reallocation
    Data() { records.reserve(MaxLines); }

    // Simple extractor. Will call Record's exractor
    friend std::istream& operator >> (std::istream& is, Data& d) {

        // Set bigger file buffer. This is a time saver
        is.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
        std::copy(std::istream_iterator<Record>(is), {}, std::back_inserter(d.records));
        return is;
    }
    // Simple inserter
    friend std::ostream& operator >> (std::ostream& os, const Data& d) {
        std::copy(d.records.begin(), d.records.end(), std::ostream_iterator<Record>(os, "\n"));
        return os;
    }

};

int main() {
    // createFile();

    auto start = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);

    if (std::ifstream ifs{ fileName }; ifs) {

        Data data;

        // Start time measurement
        start = std::chrono::system_clock::now();

        // Read and parse complete data
        ifs >> data;

        // End of time measurement
        elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
        std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";

        // Some debug output
        std::cout << "\n\nNumber of read records:  " << data.records.size() << "\n\n";
        for (size_t k{}; k < 10; ++k)
            std::cout << data.records[k] << '\n';
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading\n\n";
}

是的，我使用了“ctime”。

提高读取csv文件的速度C++

Increasing the speed of reading a csv file C++

c++

csv

io