提高读取csv文件的速度C++
Increasing the speed of reading a csv file C++
我创建此代码是为了读取和过滤我的 csv 文件。
它的工作方式就像我希望它适用于小文件一样。
但是我刚刚尝试了一个 200k 行大小的文件,大约需要 4 分钟,这对我的用例来说太长了。
经过一些测试并修复了一些非常愚蠢的问题后,我将时间缩短到了 3 分钟。
我发现大约一半的时间花在读取文件上,一半的时间花在生成结果向量上。
有什么方法可以提高我的程序速度吗?
特别是从 csv 部分读取?
我现在真的没有想法。
如果有任何帮助,我将不胜感激。
编辑:过滤器正在按特定列中的时间范围或时间范围和过滤器词过滤数据,并将数据输出到结果字符串向量中。
我的 CSV 文件是这样的->
Headers 是:
ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum
数据为:
523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4
std::ifstream input_file(strComplPath, std::ios::in);
int counter = 0;
while (std::getline(input_file, record))
{
istringstream line(record);
while (std::getline(line, record, delimiter))
{
record.erase(remove(record.begin(), record.end(), '\"'), record.end());
items.push_back(record);
//cout << record;
}
csv_contents[counter] = items;
items.clear();
++counter;
}
for (int i = 0; i < csv_contents.size(); i++) {
string regexline = csv_contents[i][1];
string endtime = time_upper_bound;
string starttime = time_lower_bound;
bool checkline = false;
bool isInRange = false, isLater = false, isEarlier = false;
// Check for faulty Data and replace it with an empty string
for (int oo = 0; oo < 8; oo++) {
if (csv_contents[i][oo].rfind("#", 0) == 0) {
csv_contents[i][oo] = "";
}
}
if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
filtertimeboth = true;
}
else if (regex_search(starttime, m, timestampformat)) {
filterfromstart = true;
}
else if (regex_search(endtime, m, timestampformat)) {
filtertoend = true;
}
}
我不确定你的程序到底是什么瓶颈(我从问题的早期版本复制了你的代码)但是你有很多regex:es并且混合了阅读记录和post 处理。我建议您创建一个 class
来保存这些记录之一,称为 record
,为 record
重载 operator>>
,然后使用文件中的 std::copy_if
您可以独立于阅读设计的过滤器。 post 在 阅读通过过滤器的记录后 处理。
我做了一个小测试,在我的旧旋转磁盘上进行过滤时读取 200k 条记录需要 2 秒。我只使用 time_lower_bound
和 time_upper_bound
进行过滤,额外的检查当然会使速度变慢一些,但应该不会花费 分钟 .
示例:
#include <algorithm>
#include <chrono>
#include <ctime>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
// the suggested class to hold a record
struct record {
int ID;
std::chrono::system_clock::time_point Timestamp;
std::string ObjectID;
std::string UserID;
std::string Area;
std::string Description;
std::string Comment;
std::string Checksum;
};
// A free function to read a time_point from an `istream`:
std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) {
std::chrono::system_clock::time_point tp{};
// C++20:
// std::chrono::from_stream(is, tp, fmt, nullptr, nullptr);
// C++11 to C++17 version:
std::tm tmtp{};
tmtp.tm_isdst = -1;
if(is >> std::get_time(&tmtp, fmt)) {
tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp));
}
return tp;
}
// The operator>> overload to read one `record` from an `istream`:
std::istream& operator>>(std::istream& is, record& r) {
is >> r.ID;
r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above
std::getline(is, r.ObjectID, ';');
std::getline(is, r.UserID, ';');
std::getline(is, r.Area, ';');
std::getline(is, r.Description, ';');
std::getline(is, r.Comment, ';');
std::getline(is, r.Checksum);
return is;
}
// An operator<< overload to print one `record`:
std::ostream& operator<<(std::ostream& os, const record& r) {
std::ostringstream oss;
oss << r.ID;
{ // I only made a C++11 to C++17 version for this one:
std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp);
std::tm ts = *std::localtime(&time);
oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.'
<< ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';';
}
oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';'
<< r.Description << ';' << r.Comment << ';' << r.Checksum << '\n';
return os << oss.str();
}
// The reading and filtering part of `main` would then look like this:
int main() { // not "void main()"
std::istringstream time_lower_bound_s("20.05.2019 16:40:00");
std::istringstream time_upper_bound_s("20.05.2021 09:40:00");
// Your time boundaries as `std::chrono::system_clock::time_point`s -
// again using the `to_tp` helper function:
auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S");
auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S");
// Verify that the boundaries were parsed ok:
if(time_lower_bound == std::chrono::system_clock::time_point{} ||
time_upper_bound == std::chrono::system_clock::time_point{}) {
std::cerr << "failed to parse boundaries\n";
return 1;
}
std::ifstream is("data"); // whatever your file is called
if(is) {
std::vector<record> recs; // a vector with all the records
// create your filter
auto filter = [&time_lower_bound, &time_upper_bound](const record& r) {
// Only copy those `record`s within the set boundaries.
// You can add additional conditions here too.
return r.Timestamp >= time_lower_bound &&
r.Timestamp <= time_upper_bound;
};
// Copy those records that pass the filter:
std::copy_if(std::istream_iterator<record>(is),
std::istream_iterator<record>{}, std::back_inserter(recs),
filter);
// .. post process `recs` here ...
// print result
for(auto& r : recs) std::cout << r;
}
}
泰德已经给出了答案。我同时做了一个解决方案。那我再补充一下吧。
我创建了包含 50 万条记录的测试数据,并且在我的机器上所有解析都在不到 3 秒的时间内完成。
此外,我还创建了类。
速度将通过使用 std::move
、增加输入缓冲区大小和使用 reserve
获得 std::vector
。
请参阅下面的另一种解决方案。我省略了过滤。 Ted 已经展示过了。
#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <ctime>
#include <vector>
#include <chrono>
#include <sstream>
#include <algorithm>
#include <iterator>
constexpr size_t MaxLines = 600'000u;
constexpr size_t NumberOfLines = 500'000u;
const std::string fileName{ "test.csv" };
// Dummy rtoutine for writing a test file
void createFile() {
if (std::ofstream ofs{ fileName }; ofs) {
std::time_t ttt = 0;
for (size_t k = 0; k < NumberOfLines; ++k) {
std::time_t time = static_cast<time_t>(ttt);
ttt += 1000;
ofs << k << ';'
#pragma warning(suppress : 4996)
<< std::put_time(std::localtime(&time), "%d.%m.%Y %H:%M") << ';'
<< k << ';'
<< "UserID" << k << ';'
<< "Area" << k << ';'
<< "Description" << k << ';'
<< "Comment" << k << ';'
<< "Checksum" << k << '\n';
}
}
else std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing\n\n";
}
// We will create a bigger input buffer for our stream
constexpr size_t ifStreamBufferSize = 100'000u;
static char buffer[ifStreamBufferSize];
// Object oriented Model. Class for one record
struct Record {
// Data
long id{};
std::tm time{};
long objectId{};
std::string userId{};
std::string area{};
std::string description{};
std::string comment{};
std::string checkSum{};
// Methods
// Extractor operator
friend std::istream& operator >> (std::istream& is, Record& r) {
// Read one complete line
if (std::string line; std::getline(is, line)) {
// Here we will stor the parts of the line after the split
std::vector<std::string> parts{};
// Convert line to istringstream for further extraction of line parts
std::istringstream iss{ line };
// One part of a line
std::string part{};
bool wrongData = false;
// Split
while (std::getline(iss, part, ';')) {
// Check fpor error
if (part[0] == '#') {
is.setstate(std::ios::failbit);
break;
}
// add part
parts.push_back(std::move(part));
}
// If all was OK
if (is) {
// If we have enough parts
if (parts.size() == 8) {
// Convert parts to target data in record
r.id = std::strtol(parts[0].c_str(), nullptr, 10);
std::istringstream ss{parts[1]};
ss >> std::get_time(& r.time, "%d.%m.%Y %H:%M");
if (ss.fail())
is.setstate(std::ios::failbit);
r.objectId = std::strtol(parts[2].c_str(), nullptr, 10);
r.userId = std::move(parts[3]);
r.area = std::move(parts[4]);
r.description = std::move(parts[5]);
r.comment = std::move(parts[6]);
r.checkSum = std::move(parts[7]);
}
else is.setstate(std::ios::failbit);
}
}
return is;
}
// Simple inserter function
friend std::ostream& operator << (std::ostream& os, const Record& r) {
return os << r.id << " "
#pragma warning(suppress : 4996)
<< std::put_time(&r.time, "%d.%m.%Y %H:%M") << " "
<< r.objectId << " " << r.userId << " " << r.area << " " << r.description << " " << r.comment << " " << r.checkSum;
}
};
// Data will hold all records
struct Data {
// Data part
std::vector<Record> records{};
// Constructor will reserve space to avaoid reallocation
Data() { records.reserve(MaxLines); }
// Simple extractor. Will call Record's exractor
friend std::istream& operator >> (std::istream& is, Data& d) {
// Set bigger file buffer. This is a time saver
is.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
std::copy(std::istream_iterator<Record>(is), {}, std::back_inserter(d.records));
return is;
}
// Simple inserter
friend std::ostream& operator >> (std::ostream& os, const Data& d) {
std::copy(d.records.begin(), d.records.end(), std::ostream_iterator<Record>(os, "\n"));
return os;
}
};
int main() {
// createFile();
auto start = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
if (std::ifstream ifs{ fileName }; ifs) {
Data data;
// Start time measurement
start = std::chrono::system_clock::now();
// Read and parse complete data
ifs >> data;
// End of time measurement
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";
// Some debug output
std::cout << "\n\nNumber of read records: " << data.records.size() << "\n\n";
for (size_t k{}; k < 10; ++k)
std::cout << data.records[k] << '\n';
}
else std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading\n\n";
}
是的,我使用了“ctime”。
我创建此代码是为了读取和过滤我的 csv 文件。 它的工作方式就像我希望它适用于小文件一样。 但是我刚刚尝试了一个 200k 行大小的文件,大约需要 4 分钟,这对我的用例来说太长了。
经过一些测试并修复了一些非常愚蠢的问题后,我将时间缩短到了 3 分钟。 我发现大约一半的时间花在读取文件上,一半的时间花在生成结果向量上。
有什么方法可以提高我的程序速度吗? 特别是从 csv 部分读取? 我现在真的没有想法。 如果有任何帮助,我将不胜感激。
编辑:过滤器正在按特定列中的时间范围或时间范围和过滤器词过滤数据,并将数据输出到结果字符串向量中。
我的 CSV 文件是这样的->
Headers 是:
ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum
数据为:
523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4
std::ifstream input_file(strComplPath, std::ios::in);
int counter = 0;
while (std::getline(input_file, record))
{
istringstream line(record);
while (std::getline(line, record, delimiter))
{
record.erase(remove(record.begin(), record.end(), '\"'), record.end());
items.push_back(record);
//cout << record;
}
csv_contents[counter] = items;
items.clear();
++counter;
}
for (int i = 0; i < csv_contents.size(); i++) {
string regexline = csv_contents[i][1];
string endtime = time_upper_bound;
string starttime = time_lower_bound;
bool checkline = false;
bool isInRange = false, isLater = false, isEarlier = false;
// Check for faulty Data and replace it with an empty string
for (int oo = 0; oo < 8; oo++) {
if (csv_contents[i][oo].rfind("#", 0) == 0) {
csv_contents[i][oo] = "";
}
}
if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
filtertimeboth = true;
}
else if (regex_search(starttime, m, timestampformat)) {
filterfromstart = true;
}
else if (regex_search(endtime, m, timestampformat)) {
filtertoend = true;
}
}
我不确定你的程序到底是什么瓶颈(我从问题的早期版本复制了你的代码)但是你有很多regex:es并且混合了阅读记录和post 处理。我建议您创建一个 class
来保存这些记录之一,称为 record
,为 record
重载 operator>>
,然后使用文件中的 std::copy_if
您可以独立于阅读设计的过滤器。 post 在 阅读通过过滤器的记录后 处理。
我做了一个小测试,在我的旧旋转磁盘上进行过滤时读取 200k 条记录需要 2 秒。我只使用 time_lower_bound
和 time_upper_bound
进行过滤,额外的检查当然会使速度变慢一些,但应该不会花费 分钟 .
示例:
#include <algorithm>
#include <chrono>
#include <ctime>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
// the suggested class to hold a record
struct record {
int ID;
std::chrono::system_clock::time_point Timestamp;
std::string ObjectID;
std::string UserID;
std::string Area;
std::string Description;
std::string Comment;
std::string Checksum;
};
// A free function to read a time_point from an `istream`:
std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) {
std::chrono::system_clock::time_point tp{};
// C++20:
// std::chrono::from_stream(is, tp, fmt, nullptr, nullptr);
// C++11 to C++17 version:
std::tm tmtp{};
tmtp.tm_isdst = -1;
if(is >> std::get_time(&tmtp, fmt)) {
tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp));
}
return tp;
}
// The operator>> overload to read one `record` from an `istream`:
std::istream& operator>>(std::istream& is, record& r) {
is >> r.ID;
r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above
std::getline(is, r.ObjectID, ';');
std::getline(is, r.UserID, ';');
std::getline(is, r.Area, ';');
std::getline(is, r.Description, ';');
std::getline(is, r.Comment, ';');
std::getline(is, r.Checksum);
return is;
}
// An operator<< overload to print one `record`:
std::ostream& operator<<(std::ostream& os, const record& r) {
std::ostringstream oss;
oss << r.ID;
{ // I only made a C++11 to C++17 version for this one:
std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp);
std::tm ts = *std::localtime(&time);
oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.'
<< ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';';
}
oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';'
<< r.Description << ';' << r.Comment << ';' << r.Checksum << '\n';
return os << oss.str();
}
// The reading and filtering part of `main` would then look like this:
int main() { // not "void main()"
std::istringstream time_lower_bound_s("20.05.2019 16:40:00");
std::istringstream time_upper_bound_s("20.05.2021 09:40:00");
// Your time boundaries as `std::chrono::system_clock::time_point`s -
// again using the `to_tp` helper function:
auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S");
auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S");
// Verify that the boundaries were parsed ok:
if(time_lower_bound == std::chrono::system_clock::time_point{} ||
time_upper_bound == std::chrono::system_clock::time_point{}) {
std::cerr << "failed to parse boundaries\n";
return 1;
}
std::ifstream is("data"); // whatever your file is called
if(is) {
std::vector<record> recs; // a vector with all the records
// create your filter
auto filter = [&time_lower_bound, &time_upper_bound](const record& r) {
// Only copy those `record`s within the set boundaries.
// You can add additional conditions here too.
return r.Timestamp >= time_lower_bound &&
r.Timestamp <= time_upper_bound;
};
// Copy those records that pass the filter:
std::copy_if(std::istream_iterator<record>(is),
std::istream_iterator<record>{}, std::back_inserter(recs),
filter);
// .. post process `recs` here ...
// print result
for(auto& r : recs) std::cout << r;
}
}
泰德已经给出了答案。我同时做了一个解决方案。那我再补充一下吧。
我创建了包含 50 万条记录的测试数据,并且在我的机器上所有解析都在不到 3 秒的时间内完成。
此外,我还创建了类。
速度将通过使用 std::move
、增加输入缓冲区大小和使用 reserve
获得 std::vector
。
请参阅下面的另一种解决方案。我省略了过滤。 Ted 已经展示过了。
#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <ctime>
#include <vector>
#include <chrono>
#include <sstream>
#include <algorithm>
#include <iterator>
constexpr size_t MaxLines = 600'000u;
constexpr size_t NumberOfLines = 500'000u;
const std::string fileName{ "test.csv" };
// Dummy rtoutine for writing a test file
void createFile() {
if (std::ofstream ofs{ fileName }; ofs) {
std::time_t ttt = 0;
for (size_t k = 0; k < NumberOfLines; ++k) {
std::time_t time = static_cast<time_t>(ttt);
ttt += 1000;
ofs << k << ';'
#pragma warning(suppress : 4996)
<< std::put_time(std::localtime(&time), "%d.%m.%Y %H:%M") << ';'
<< k << ';'
<< "UserID" << k << ';'
<< "Area" << k << ';'
<< "Description" << k << ';'
<< "Comment" << k << ';'
<< "Checksum" << k << '\n';
}
}
else std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing\n\n";
}
// We will create a bigger input buffer for our stream
constexpr size_t ifStreamBufferSize = 100'000u;
static char buffer[ifStreamBufferSize];
// Object oriented Model. Class for one record
struct Record {
// Data
long id{};
std::tm time{};
long objectId{};
std::string userId{};
std::string area{};
std::string description{};
std::string comment{};
std::string checkSum{};
// Methods
// Extractor operator
friend std::istream& operator >> (std::istream& is, Record& r) {
// Read one complete line
if (std::string line; std::getline(is, line)) {
// Here we will stor the parts of the line after the split
std::vector<std::string> parts{};
// Convert line to istringstream for further extraction of line parts
std::istringstream iss{ line };
// One part of a line
std::string part{};
bool wrongData = false;
// Split
while (std::getline(iss, part, ';')) {
// Check fpor error
if (part[0] == '#') {
is.setstate(std::ios::failbit);
break;
}
// add part
parts.push_back(std::move(part));
}
// If all was OK
if (is) {
// If we have enough parts
if (parts.size() == 8) {
// Convert parts to target data in record
r.id = std::strtol(parts[0].c_str(), nullptr, 10);
std::istringstream ss{parts[1]};
ss >> std::get_time(& r.time, "%d.%m.%Y %H:%M");
if (ss.fail())
is.setstate(std::ios::failbit);
r.objectId = std::strtol(parts[2].c_str(), nullptr, 10);
r.userId = std::move(parts[3]);
r.area = std::move(parts[4]);
r.description = std::move(parts[5]);
r.comment = std::move(parts[6]);
r.checkSum = std::move(parts[7]);
}
else is.setstate(std::ios::failbit);
}
}
return is;
}
// Simple inserter function
friend std::ostream& operator << (std::ostream& os, const Record& r) {
return os << r.id << " "
#pragma warning(suppress : 4996)
<< std::put_time(&r.time, "%d.%m.%Y %H:%M") << " "
<< r.objectId << " " << r.userId << " " << r.area << " " << r.description << " " << r.comment << " " << r.checkSum;
}
};
// Data will hold all records
struct Data {
// Data part
std::vector<Record> records{};
// Constructor will reserve space to avaoid reallocation
Data() { records.reserve(MaxLines); }
// Simple extractor. Will call Record's exractor
friend std::istream& operator >> (std::istream& is, Data& d) {
// Set bigger file buffer. This is a time saver
is.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
std::copy(std::istream_iterator<Record>(is), {}, std::back_inserter(d.records));
return is;
}
// Simple inserter
friend std::ostream& operator >> (std::ostream& os, const Data& d) {
std::copy(d.records.begin(), d.records.end(), std::ostream_iterator<Record>(os, "\n"));
return os;
}
};
int main() {
// createFile();
auto start = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
if (std::ifstream ifs{ fileName }; ifs) {
Data data;
// Start time measurement
start = std::chrono::system_clock::now();
// Read and parse complete data
ifs >> data;
// End of time measurement
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";
// Some debug output
std::cout << "\n\nNumber of read records: " << data.records.size() << "\n\n";
for (size_t k{}; k < 10; ++k)
std::cout << data.records[k] << '\n';
}
else std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading\n\n";
}
是的,我使用了“ctime”。