提高读取csv文件的速度C++

Increasing the speed of reading a csv file C++

我创建此代码是为了读取和过滤我的 csv 文件。 它的工作方式就像我希望它适用于小文件一样。 但是我刚刚尝试了一个 200k 行大小的文件,大约需要 4 分钟,这对我的用例来说太长了。

经过一些测试并修复了一些非常愚蠢的问题后,我将时间缩短到了 3 分钟。 我发现大约一半的时间花在读取文件上,一半的时间花在生成结果向量上。

有什么方法可以提高我的程序速度吗? 特别是从 csv 部分读取? 我现在真的没有想法。 如果有任何帮助,我将不胜感激。

编辑:过滤器正在按特定列中的时间范围或时间范围和过滤器词过滤数据,并将数据输出到结果字符串向量中。

我的 CSV 文件是这样的->

Headers 是:

ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum

数据为:

523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4
std::ifstream input_file(strComplPath, std::ios::in);

int counter = 0;
while (std::getline(input_file, record))
{
    istringstream line(record);
    while (std::getline(line, record, delimiter))
    {
        record.erase(remove(record.begin(), record.end(), '\"'), record.end());
        items.push_back(record);
        //cout << record;
    }

    csv_contents[counter] = items;
    items.clear();
    ++counter;
}
 

for (int i = 0; i < csv_contents.size(); i++) {
    string regexline = csv_contents[i][1];
    string endtime = time_upper_bound;
    string starttime = time_lower_bound;
    bool checkline = false;
    bool isInRange = false, isLater = false, isEarlier = false;

    // Check for faulty Data and replace it with an empty string 
    for (int oo = 0; oo < 8; oo++) {
        if (csv_contents[i][oo].rfind("#", 0) == 0) {
            csv_contents[i][oo] = "";
        }
    }

    if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
        filtertimeboth = true;
    }
    else if (regex_search(starttime, m, timestampformat)) {
        filterfromstart = true;
    }
    else if (regex_search(endtime, m, timestampformat)) {
        filtertoend = true;
    }
}

我不确定你的程序到底是什么瓶颈(我从问题的早期版本复制了你的代码)但是你有很多regex:es并且混合了阅读记录和post 处理。我建议您创建一个 class 来保存这些记录之一,称为 record,为 record 重载 operator>>,然后使用文件中的 std::copy_if您可以独立于阅读设计的过滤器。 post 在 阅读通过过滤器的记录后 处理。

我做了一个小测试,在我的旧旋转磁盘上进行过滤时读取 200k 条记录需要 2 秒。我只使用 time_lower_boundtime_upper_bound 进行过滤,额外的检查当然会使速度变慢一些,但应该不会花费 分钟 .

示例:

#include <algorithm>
#include <chrono>
#include <ctime>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <thread>
#include <vector>

// the suggested class to hold a record
struct record {
    int ID;
    std::chrono::system_clock::time_point Timestamp;
    std::string ObjectID;
    std::string UserID;
    std::string Area;
    std::string Description;
    std::string Comment;
    std::string Checksum;
};
// A free function to read a time_point from an `istream`:
std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) {
    std::chrono::system_clock::time_point tp{};
    // C++20:
    // std::chrono::from_stream(is, tp, fmt, nullptr, nullptr);

    // C++11 to C++17 version:
    std::tm tmtp{};
    tmtp.tm_isdst = -1;
    if(is >> std::get_time(&tmtp, fmt)) {
        tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp));
    }
    return tp;
}
// The operator>> overload to read one `record` from an `istream`:
std::istream& operator>>(std::istream& is, record& r) {
    is >> r.ID;
    r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above
    std::getline(is, r.ObjectID, ';');
    std::getline(is, r.UserID, ';');
    std::getline(is, r.Area, ';');
    std::getline(is, r.Description, ';');
    std::getline(is, r.Comment, ';');
    std::getline(is, r.Checksum);
    return is;
}
// An operator<< overload to print one `record`:
std::ostream& operator<<(std::ostream& os, const record& r) {
    std::ostringstream oss;
    oss << r.ID;
    { // I only made a C++11 to C++17 version for this one:
        std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp);
        std::tm ts = *std::localtime(&time);
        oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.'
            << ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';';
    }
    oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';'
        << r.Description << ';' << r.Comment << ';' << r.Checksum << '\n';
    return os << oss.str();
}
// The reading and filtering part of `main` would then look like this:
int main() { // not "void main()"
    std::istringstream time_lower_bound_s("20.05.2019 16:40:00");
    std::istringstream time_upper_bound_s("20.05.2021 09:40:00");

    // Your time boundaries as `std::chrono::system_clock::time_point`s - 
    // again using the `to_tp` helper function:
    auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S");
    auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S");

    // Verify that the boundaries were parsed ok:
    if(time_lower_bound == std::chrono::system_clock::time_point{} ||
       time_upper_bound == std::chrono::system_clock::time_point{}) {
        std::cerr << "failed to parse boundaries\n";
        return 1;
    }

    std::ifstream is("data"); // whatever your file is called
    if(is) {
        std::vector<record> recs; // a vector with all the records

        // create your filter
        auto filter = [&time_lower_bound, &time_upper_bound](const record& r) {
            // Only copy those `record`s within the set boundaries.
            // You can add additional conditions here too.
            return r.Timestamp >= time_lower_bound &&
                   r.Timestamp <= time_upper_bound;
        };

        // Copy those records that pass the filter:
        std::copy_if(std::istream_iterator<record>(is),
                     std::istream_iterator<record>{}, std::back_inserter(recs),
                     filter);

        // .. post process `recs` here ...

        // print result
        for(auto& r : recs) std::cout << r;
    }
}

泰德已经给出了答案。我同时做了一个解决方案。那我再补充一下吧。

我创建了包含 50 万条记录的测试数据,并且在我的机器上所有解析都在不到 3 秒的时间内完成。

此外,我还创建了类。

速度将通过使用 std::move、增加输入缓冲区大小和使用 reserve 获得 std::vector

请参阅下面的另一种解决方案。我省略了过滤。 Ted 已经展示过了。

#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <ctime>
#include <vector>
#include <chrono>
#include <sstream>
#include <algorithm>
#include <iterator>

constexpr size_t MaxLines = 600'000u;
constexpr size_t NumberOfLines = 500'000u;
const std::string fileName{ "test.csv" };

// Dummy rtoutine for writing a test file
void createFile() {
    if (std::ofstream ofs{ fileName }; ofs) {
        std::time_t ttt = 0;
        for (size_t k = 0; k < NumberOfLines; ++k) {
            std::time_t time = static_cast<time_t>(ttt);
            ttt += 1000;
            ofs << k << ';'
#pragma warning(suppress : 4996)
                << std::put_time(std::localtime(&time), "%d.%m.%Y  %H:%M") << ';'
                << k << ';'
                << "UserID" << k << ';'
                << "Area" << k << ';'
                << "Description" << k << ';'
                << "Comment" << k << ';'
                << "Checksum" << k << '\n';
        }
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing\n\n";
}


// We will create a bigger input buffer for our stream
constexpr size_t ifStreamBufferSize = 100'000u;
static char buffer[ifStreamBufferSize];


// Object oriented Model. Class for one record
struct Record {

    // Data
    long id{};
    std::tm time{};
    long objectId{};
    std::string userId{};
    std::string area{};
    std::string description{};
    std::string comment{};
    std::string checkSum{};

    // Methods
    // Extractor operator
    friend std::istream& operator >> (std::istream& is, Record& r) {

        // Read one complete line
        if (std::string line; std::getline(is, line)) {

            // Here we will stor the parts of the line after the split
            std::vector<std::string> parts{};

            // Convert line to istringstream for further extraction of line parts
            std::istringstream iss{ line };

            // One part of a line
            std::string part{};
            bool wrongData = false;

            // Split
            while (std::getline(iss, part, ';')) {

                // Check fpor error
                if (part[0] == '#') {
                    is.setstate(std::ios::failbit);
                    break;
                }
                // add part
                parts.push_back(std::move(part));
            }
            // If all was OK
            if (is) {
                // If we have enough parts
                if (parts.size() == 8) {

                    // Convert parts to target data in record
                    r.id = std::strtol(parts[0].c_str(), nullptr, 10);

                    std::istringstream ss{parts[1]};
                    ss >> std::get_time(& r.time, "%d.%m.%Y  %H:%M");
                    if (ss.fail()) 
                        is.setstate(std::ios::failbit);

                    r.objectId = std::strtol(parts[2].c_str(), nullptr, 10);

                    r.userId = std::move(parts[3]);

                    r.area = std::move(parts[4]);

                    r.description = std::move(parts[5]);

                    r.comment = std::move(parts[6]);

                    r.checkSum = std::move(parts[7]);
                }
                else is.setstate(std::ios::failbit);
            }
        }
        return is;
    }
    // Simple inserter function
    friend std::ostream& operator << (std::ostream& os, const Record& r) {
        return os << r.id << "   "
#pragma warning(suppress : 4996)
            << std::put_time(&r.time, "%d.%m.%Y  %H:%M") << "   "  
            << r.objectId << "   " << r.userId << "   " << r.area << "   " << r.description << "   " << r.comment << "   " << r.checkSum;
    }
};

// Data will hold all records
struct Data {

    // Data part
    std::vector<Record> records{};

    // Constructor will reserve space to avaoid reallocation
    Data() { records.reserve(MaxLines); }

    // Simple extractor. Will call Record's exractor
    friend std::istream& operator >> (std::istream& is, Data& d) {

        // Set bigger file buffer. This is a time saver
        is.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
        std::copy(std::istream_iterator<Record>(is), {}, std::back_inserter(d.records));
        return is;
    }
    // Simple inserter
    friend std::ostream& operator >> (std::ostream& os, const Data& d) {
        std::copy(d.records.begin(), d.records.end(), std::ostream_iterator<Record>(os, "\n"));
        return os;
    }

};

int main() {
    // createFile();

    auto start = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);

    if (std::ifstream ifs{ fileName }; ifs) {

        Data data;

        // Start time measurement
        start = std::chrono::system_clock::now();

        // Read and parse complete data
        ifs >> data;

        // End of time measurement
        elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
        std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";

        // Some debug output
        std::cout << "\n\nNumber of read records:  " << data.records.size() << "\n\n";
        for (size_t k{}; k < 10; ++k)
            std::cout << data.records[k] << '\n';
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading\n\n";
}

是的,我使用了“ctime”。