Boost 1.59 不解压缩所有 bzip2 流
Boost 1.59 not decompressing all bzip2 streams
我一直在尝试逐行解压缩一些 .bz2 文件,可以这么说,因为我正在处理的文件是大量未压缩的(未压缩的 100 GB 区域)所以我想添加一个节省磁盘的解决方案 space.
我在使用 vanilla bzip2 压缩的文件解压缩时没有问题,但使用 pbzip2 压缩的文件只解压缩它找到的第一个 bz2 流。这个 bugtracker 与问题有关:https://svn.boost.org/trac/boost/ticket/3853 但我相信它已在 1.41 版之后修复。我检查了 bzip2.hpp 文件,它包含 'fixed' 版本,我还检查了程序中使用的 Boost 版本是 1.59。
代码在这里:
cout<<"Warning bzip2 support is a little buggy!"<<endl;
//Open the file here
trans_file.open(files[i].c_str(), std::ios_base::in | std::ios_base::binary);
//Set up boost bzip2 compression
boost::iostreams::filtering_istream in;
in.push(boost::iostreams::bzip2_decompressor());
in.push(trans_file);
std::string str;
//Begin reading
while(std::getline(in, str))
{
std::stringstream stream(str);
stream>>id_f>>id_i>>aif;
/* Do stuff with values here*/
}
任何建议都会很棒。谢谢!
你是对的。
变更集 #63057 似乎只解决了部分问题。
不过,相应的单元测试确实有效。但它使用 copy
算法(如果相关的话,也在 composite<>
而不是 filtering_istream
上)。
我会将其作为缺陷或回归打开。当然,包括一个展示问题的文件。对我来说,它仅使用 /etc/dictionaries-common/words
和 pbzip2
压缩(默认选项)进行复制。
我这里有test.bz2
:http://7f0d2fd2-af79-415c-ab60-033d3b494dc9.s3.amazonaws.com/test.bz2
这是我的测试程序:
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/bzip2.hpp>
#include <boost/iostreams/stream.hpp>
#include <fstream>
#include <iostream>
namespace io = boost::iostreams;
void multiple_member_test(); // from the unit tests in changeset #63057
int main() {
//multiple_member_test();
//return 0;
std::ifstream trans_file("test.bz2", std::ios::binary);
//Set up boost bzip2 compression
io::filtering_istream in;
in.push(io::bzip2_decompressor());
in.push(trans_file);
//Begin reading
std::string str;
while(std::getline(in, str))
{
std::cout << str << "\n";
}
}
#include <boost/iostreams/compose.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/device/back_inserter.hpp>
#include <cassert>
#include <sstream>
void multiple_member_test() // from the unit tests in changeset #63057
{
std::string data(20ul << 20, '*');
std::vector<char> temp, dest;
// Write compressed data to temp, twice in succession
io::filtering_ostream out;
out.push(io::bzip2_compressor());
out.push(io::back_inserter(temp));
io::copy(boost::make_iterator_range(data), out);
out.push(io::back_inserter(temp));
io::copy(boost::make_iterator_range(data), out);
// Read compressed data from temp into dest
io::filtering_istream in;
in.push(io::bzip2_decompressor());
in.push(io::array_source(&temp[0], temp.size()));
io::copy(in, io::back_inserter(dest));
// Check that dest consists of two copies of data
assert(data.size() * 2 == dest.size());
assert(std::equal(data.begin(), data.end(), dest.begin()));
assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2));
dest.clear();
io::copy(
io::array_source(&temp[0], temp.size()),
io::compose(io::bzip2_decompressor(), io::back_inserter(dest)));
// Check that dest consists of two copies of data
assert(data.size() * 2 == dest.size());
assert(std::equal(data.begin(), data.end(), dest.begin()));
assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2));
}
我一直在尝试逐行解压缩一些 .bz2 文件,可以这么说,因为我正在处理的文件是大量未压缩的(未压缩的 100 GB 区域)所以我想添加一个节省磁盘的解决方案 space.
我在使用 vanilla bzip2 压缩的文件解压缩时没有问题,但使用 pbzip2 压缩的文件只解压缩它找到的第一个 bz2 流。这个 bugtracker 与问题有关:https://svn.boost.org/trac/boost/ticket/3853 但我相信它已在 1.41 版之后修复。我检查了 bzip2.hpp 文件,它包含 'fixed' 版本,我还检查了程序中使用的 Boost 版本是 1.59。
代码在这里:
cout<<"Warning bzip2 support is a little buggy!"<<endl;
//Open the file here
trans_file.open(files[i].c_str(), std::ios_base::in | std::ios_base::binary);
//Set up boost bzip2 compression
boost::iostreams::filtering_istream in;
in.push(boost::iostreams::bzip2_decompressor());
in.push(trans_file);
std::string str;
//Begin reading
while(std::getline(in, str))
{
std::stringstream stream(str);
stream>>id_f>>id_i>>aif;
/* Do stuff with values here*/
}
任何建议都会很棒。谢谢!
你是对的。
变更集 #63057 似乎只解决了部分问题。
不过,相应的单元测试确实有效。但它使用 copy
算法(如果相关的话,也在 composite<>
而不是 filtering_istream
上)。
我会将其作为缺陷或回归打开。当然,包括一个展示问题的文件。对我来说,它仅使用 /etc/dictionaries-common/words
和 pbzip2
压缩(默认选项)进行复制。
我这里有test.bz2
:http://7f0d2fd2-af79-415c-ab60-033d3b494dc9.s3.amazonaws.com/test.bz2
这是我的测试程序:
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/bzip2.hpp>
#include <boost/iostreams/stream.hpp>
#include <fstream>
#include <iostream>
namespace io = boost::iostreams;
void multiple_member_test(); // from the unit tests in changeset #63057
int main() {
//multiple_member_test();
//return 0;
std::ifstream trans_file("test.bz2", std::ios::binary);
//Set up boost bzip2 compression
io::filtering_istream in;
in.push(io::bzip2_decompressor());
in.push(trans_file);
//Begin reading
std::string str;
while(std::getline(in, str))
{
std::cout << str << "\n";
}
}
#include <boost/iostreams/compose.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/device/back_inserter.hpp>
#include <cassert>
#include <sstream>
void multiple_member_test() // from the unit tests in changeset #63057
{
std::string data(20ul << 20, '*');
std::vector<char> temp, dest;
// Write compressed data to temp, twice in succession
io::filtering_ostream out;
out.push(io::bzip2_compressor());
out.push(io::back_inserter(temp));
io::copy(boost::make_iterator_range(data), out);
out.push(io::back_inserter(temp));
io::copy(boost::make_iterator_range(data), out);
// Read compressed data from temp into dest
io::filtering_istream in;
in.push(io::bzip2_decompressor());
in.push(io::array_source(&temp[0], temp.size()));
io::copy(in, io::back_inserter(dest));
// Check that dest consists of two copies of data
assert(data.size() * 2 == dest.size());
assert(std::equal(data.begin(), data.end(), dest.begin()));
assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2));
dest.clear();
io::copy(
io::array_source(&temp[0], temp.size()),
io::compose(io::bzip2_decompressor(), io::back_inserter(dest)));
// Check that dest consists of two copies of data
assert(data.size() * 2 == dest.size());
assert(std::equal(data.begin(), data.end(), dest.begin()));
assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2));
}