在 Python 中从 Freebase 中提取数据转储
Extract Data Dump From Freebase in Python
使用从 website 下载的数据转储 Freebase 三元组 (freebase-rdf-latest.gz),打开和读取此文件以提取信息的最佳过程是什么,比方说关于公司和企业的相关信息? (在Python)
据我所知,有一些包可以实现这个目标:在 python 中打开 gz 文件并读取 rdf 文件,我不确定如何完成这个...
我在 python 3.6
中的失败尝试:
import gzip
with gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file:
for line in uncompressed_file.read():
print(line)
之后使用 xml 结构我可以通过解析它来获取信息,但我无法读取文件。
问题是 gzip 模块一次解压整个文件,将解压后的文件存储在内存中。对于这么大的文件,更实用的方法是一次一点点地解压文件,然后流式传输结果。
#!/usr/bin/env python3
import io
import zlib
def stream_unzipped_bytes(filename):
"""
Generator function, reads gzip file `filename` and yields
uncompressed bytes.
This function answers your original question, how to read the file,
but its output is a generator of bytes so there's another function
below to stream these bytes as text, one line at a time.
"""
with open(filename, 'rb') as f:
wbits = zlib.MAX_WBITS | 16 # 16 requires gzip header/trailer
decompressor = zlib.decompressobj(wbits)
fbytes = f.read(16384)
while fbytes:
yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)
fbytes = f.read(16384)
def stream_text_lines(gen):
"""
Generator wrapper function, `gen` is a bytes generator.
Yields one line of text at a time.
"""
try:
buf = next(gen)
while buf:
lines = buf.splitlines(keepends=True)
# yield all but the last line, because this may still be incomplete
# and waiting for more data from gen
for line in lines[:-1]:
yield line.decode()
# set buf to end of prior data, plus next from the generator.
# do this in two separate calls in case gen is done iterating,
# so the last output is not lost.
buf = lines[-1]
buf += next(gen)
except StopIteration:
# yield the final data
if buf:
yield buf.decode()
# Sample usage, using the stream_text_lines generator to stream
# one line of RDF text at a time
bytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))
for line in stream_text_lines(bytes_generator):
# do something with `line` of text
print(line, end='')
使用从 website 下载的数据转储 Freebase 三元组 (freebase-rdf-latest.gz),打开和读取此文件以提取信息的最佳过程是什么,比方说关于公司和企业的相关信息? (在Python)
据我所知,有一些包可以实现这个目标:在 python 中打开 gz 文件并读取 rdf 文件,我不确定如何完成这个...
我在 python 3.6
中的失败尝试:
import gzip
with gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file:
for line in uncompressed_file.read():
print(line)
之后使用 xml 结构我可以通过解析它来获取信息,但我无法读取文件。
问题是 gzip 模块一次解压整个文件,将解压后的文件存储在内存中。对于这么大的文件,更实用的方法是一次一点点地解压文件,然后流式传输结果。
#!/usr/bin/env python3
import io
import zlib
def stream_unzipped_bytes(filename):
"""
Generator function, reads gzip file `filename` and yields
uncompressed bytes.
This function answers your original question, how to read the file,
but its output is a generator of bytes so there's another function
below to stream these bytes as text, one line at a time.
"""
with open(filename, 'rb') as f:
wbits = zlib.MAX_WBITS | 16 # 16 requires gzip header/trailer
decompressor = zlib.decompressobj(wbits)
fbytes = f.read(16384)
while fbytes:
yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)
fbytes = f.read(16384)
def stream_text_lines(gen):
"""
Generator wrapper function, `gen` is a bytes generator.
Yields one line of text at a time.
"""
try:
buf = next(gen)
while buf:
lines = buf.splitlines(keepends=True)
# yield all but the last line, because this may still be incomplete
# and waiting for more data from gen
for line in lines[:-1]:
yield line.decode()
# set buf to end of prior data, plus next from the generator.
# do this in two separate calls in case gen is done iterating,
# so the last output is not lost.
buf = lines[-1]
buf += next(gen)
except StopIteration:
# yield the final data
if buf:
yield buf.decode()
# Sample usage, using the stream_text_lines generator to stream
# one line of RDF text at a time
bytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))
for line in stream_text_lines(bytes_generator):
# do something with `line` of text
print(line, end='')