如何从 warc 文件中读取记录的子集
How to read a subset of records from a warc file
我正在尝试解析 Python 中 Common Crawl 的 .warc 文件。
由于文件很大,我想从前几条记录中的sample/subset开始。
如何截断文件以仅包含前 X 行,同时保留现有的 newlines/carriage return?
这是我已经尝试过的方法:
head -n 250 oldfile > newfile
这将删除解析文件所需的一些 return。如果我尝试在我的 Hadoop 作业中使用此文件(使用 warc
包读取它),这是我得到的错误:
Traceback (most recent call last):
File "test.py", line 46, in <module>
TagGrabber.run()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
mr_job.execute()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
super(MRJob, self).execute()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute
self.run_job()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job
runner.run()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run
self._run()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run
self._invoke_step(step_num, 'mapper')
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step
self.per_step_runner_finish(step_num)
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish
self._wait_for_process(proc_dict, step_num)
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process
(proc_dict['args'], returncode, ''.join(tb_lines)))
Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1:
Traceback (most recent call last):
File "test.py", line 46, in <module>
TagGrabber.run()
File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run
mr_job.execute()
File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute
self.run_mapper(self.options.step_num)
File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "/var/cc-mrjob/mrcc.py", line 33, in mapper
for i, record in enumerate(f):
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
record = self.read_record()
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/1.0\n'
与#1 相同,但使用 tail
命令
- 与#1 相同,但在之后使用
tr
或 sed
来替换任何丢失的换行符或 ^M
(回车 return)字符。这导致 warc
包仍然抱怨预期的回车 return 或换行符没有到位。
unix2dos oldfile
很难正确处理换行符,因为 .warc 文件也可能包含二进制数据。截断也可能会产生损坏的 .warc 文件,因为例如 python 库相信 Content-Length headers 是有效的。
warc python lib 一次只从 .warc 文件中读取一条记录(避免一次将整个文件读入内存),因此可以使用 [=16= 处理子集] 只要。例如:
import warc
from itertools import islice
N = 10
warc_file = warc.open('/path/to/file.warc')
for record in islice(warc_file, N):
do_stuff_with(record)
我正在尝试解析 Python 中 Common Crawl 的 .warc 文件。
由于文件很大,我想从前几条记录中的sample/subset开始。
如何截断文件以仅包含前 X 行,同时保留现有的 newlines/carriage return?
这是我已经尝试过的方法:
head -n 250 oldfile > newfile
这将删除解析文件所需的一些 return。如果我尝试在我的 Hadoop 作业中使用此文件(使用warc
包读取它),这是我得到的错误:Traceback (most recent call last): File "test.py", line 46, in <module> TagGrabber.run() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run mr_job.execute() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute super(MRJob, self).execute() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute self.run_job() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job runner.run() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run self._run() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run self._invoke_step(step_num, 'mapper') File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step self.per_step_runner_finish(step_num) File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish self._wait_for_process(proc_dict, step_num) File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process (proc_dict['args'], returncode, ''.join(tb_lines))) Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1: Traceback (most recent call last): File "test.py", line 46, in <module> TagGrabber.run() File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run mr_job.execute() File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute self.run_mapper(self.options.step_num) File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper for out_key, out_value in mapper(key, value) or (): File "/var/cc-mrjob/mrcc.py", line 33, in mapper for i, record in enumerate(f): File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__ record = self.read_record() File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record header = self.read_header(fileobj) File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header raise IOError("Bad version line: %r" % version_line) IOError: Bad version line: 'WARC/1.0\n'
与#1 相同,但使用
tail
命令- 与#1 相同,但在之后使用
tr
或sed
来替换任何丢失的换行符或^M
(回车 return)字符。这导致warc
包仍然抱怨预期的回车 return 或换行符没有到位。 unix2dos oldfile
很难正确处理换行符,因为 .warc 文件也可能包含二进制数据。截断也可能会产生损坏的 .warc 文件,因为例如 python 库相信 Content-Length headers 是有效的。
warc python lib 一次只从 .warc 文件中读取一条记录(避免一次将整个文件读入内存),因此可以使用 [=16= 处理子集] 只要。例如:
import warc
from itertools import islice
N = 10
warc_file = warc.open('/path/to/file.warc')
for record in islice(warc_file, N):
do_stuff_with(record)