无法读回使用 python fastavro 包创建的微小 avro 文件
Can't read back tiny avro file created with python fastavro package
我有一个小文件要序列化
这是一个仅包含该行的文本文件:
INSERT INTO `pagelinks` VALUES (11442565,0,'Présent_de_narration',2600),(10265670,0,'Président',2600);
我用这个小程序来做,使用 fastavro 0.17.9 包(python 3.5.2 on Ubuntu 16.04 LTS):
import sys, re
from fastavro import writer
schema = {
"namespace": "com.projet4.pagelinks",
"type": "record",
"name": "pagelink",
"fields": [
{"name": "page_id", "type": "int"},
{"name": "page_title", "type": "string"}
]
}
insert_regex = re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')
row_regex = re.compile("""(.*),(.*),'(.*)',(.*)""")
for line in sys.stdin:
avro_file = open("pagelinks.avro", 'wb')
match = insert_regex.match(line.strip())
if match is not None:
data = match.groups(0)[0]
rows = data[1:-1].split("),(")
for row in rows:
row_match = row_regex.match(row)
if row_match is not None:
# >>> row_match.groups()
# (12,0,'Anti-statism',0)
# # page_id, pl_namespace, pl_title, pl_from_namespace
if row_match.groups()[1] == '0':
page_id, pl_title = row_match.groups()[0], row_match.groups()[2]
print(int(page_id), pl_title)
writer(avro_file, schema, [{"page_id":int(page_id), "page_title":pl_title}])
我用这个命令行启动程序:
cat pagelinks_nano.sql | ./parse_links_fastavro_test.py
好像成功了,avro 文件创建好了,然后我尝试读取它:
import fastavro
with open("pagelinks.avro", 'rb') as avro_file:
reader = fastavro.reader(avro_file)
print("Embedded Schema :\n\n",reader.schema,"\n\nLines :")
for pagelink in reader:
print(pagelink)
问题来了
文件已打开,架构出现,第一行也是
但是程序崩溃并显示消息:
Embedded Schema :
{'name': 'pagelink', 'type': 'record', 'namespace': 'com.projet4.pagelinks', 'fields': [{'name': 'page_id', 'type': 'int'}, {'name': 'page_title', 'type': 'string'}]}
Lines :
{'page_id': 11442565, 'page_title': 'Présent_de_narration'}
Traceback (most recent call last):
File "./reading.py", line 5, in <module>
for pagelink in reader:
File "fastavro/_read.pyx", line 645, in _iter_avro
File "fastavro/_read.pyx", line 548, in fastavro._read.skip_sync
ValueError: expected sync marker not found
是 fastavro 还是编码问题?
任何帮助将不胜感激:o
谢谢大家
浏览fastavro官方找到解决方案github
https://github.com/tebeka/fastavro/issues/12
"Currently fastavro supports only "one shot" 写道。但是记录可以是任何可迭代的,包括一个生成器,一个一个地创建记录。我会考虑追加。"
缓冲值并仅在循环结束时写入做到了
insert_regex = re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')
row_regex = re.compile("""(.*),(.*),'(.*)',(.*)""")
avro_content = []
for line in sys.stdin:
avro_file = open("pagelinks.avro", 'wb')
match = insert_regex.match(line.strip())
if match is not None:
data = match.groups(0)[0]
rows = data[1:-1].split("),(")
for row in rows:
row_match = row_regex.match(row)
if row_match is not None:
# >>> row_match.groups()
# (12,0,'Anti-statism',0)
# # page_id, pl_namespace, pl_title, pl_from_namespace
if row_match.groups()[1] == '0':
page_id, pl_title = row_match.groups()[0], row_match.groups()[2]
avro_content.append({"page_id":int(page_id), "page_title":pl_title})
writer(avro_file, schema, avro_content)
我有一个小文件要序列化 这是一个仅包含该行的文本文件:
INSERT INTO `pagelinks` VALUES (11442565,0,'Présent_de_narration',2600),(10265670,0,'Président',2600);
我用这个小程序来做,使用 fastavro 0.17.9 包(python 3.5.2 on Ubuntu 16.04 LTS):
import sys, re
from fastavro import writer
schema = {
"namespace": "com.projet4.pagelinks",
"type": "record",
"name": "pagelink",
"fields": [
{"name": "page_id", "type": "int"},
{"name": "page_title", "type": "string"}
]
}
insert_regex = re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')
row_regex = re.compile("""(.*),(.*),'(.*)',(.*)""")
for line in sys.stdin:
avro_file = open("pagelinks.avro", 'wb')
match = insert_regex.match(line.strip())
if match is not None:
data = match.groups(0)[0]
rows = data[1:-1].split("),(")
for row in rows:
row_match = row_regex.match(row)
if row_match is not None:
# >>> row_match.groups()
# (12,0,'Anti-statism',0)
# # page_id, pl_namespace, pl_title, pl_from_namespace
if row_match.groups()[1] == '0':
page_id, pl_title = row_match.groups()[0], row_match.groups()[2]
print(int(page_id), pl_title)
writer(avro_file, schema, [{"page_id":int(page_id), "page_title":pl_title}])
我用这个命令行启动程序:
cat pagelinks_nano.sql | ./parse_links_fastavro_test.py
好像成功了,avro 文件创建好了,然后我尝试读取它:
import fastavro
with open("pagelinks.avro", 'rb') as avro_file:
reader = fastavro.reader(avro_file)
print("Embedded Schema :\n\n",reader.schema,"\n\nLines :")
for pagelink in reader:
print(pagelink)
问题来了
文件已打开,架构出现,第一行也是 但是程序崩溃并显示消息:
Embedded Schema :
{'name': 'pagelink', 'type': 'record', 'namespace': 'com.projet4.pagelinks', 'fields': [{'name': 'page_id', 'type': 'int'}, {'name': 'page_title', 'type': 'string'}]}
Lines :
{'page_id': 11442565, 'page_title': 'Présent_de_narration'}
Traceback (most recent call last):
File "./reading.py", line 5, in <module>
for pagelink in reader:
File "fastavro/_read.pyx", line 645, in _iter_avro
File "fastavro/_read.pyx", line 548, in fastavro._read.skip_sync
ValueError: expected sync marker not found
是 fastavro 还是编码问题?
任何帮助将不胜感激:o
谢谢大家
浏览fastavro官方找到解决方案github
https://github.com/tebeka/fastavro/issues/12
"Currently fastavro supports only "one shot" 写道。但是记录可以是任何可迭代的,包括一个生成器,一个一个地创建记录。我会考虑追加。"
缓冲值并仅在循环结束时写入做到了
insert_regex = re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')
row_regex = re.compile("""(.*),(.*),'(.*)',(.*)""")
avro_content = []
for line in sys.stdin:
avro_file = open("pagelinks.avro", 'wb')
match = insert_regex.match(line.strip())
if match is not None:
data = match.groups(0)[0]
rows = data[1:-1].split("),(")
for row in rows:
row_match = row_regex.match(row)
if row_match is not None:
# >>> row_match.groups()
# (12,0,'Anti-statism',0)
# # page_id, pl_namespace, pl_title, pl_from_namespace
if row_match.groups()[1] == '0':
page_id, pl_title = row_match.groups()[0], row_match.groups()[2]
avro_content.append({"page_id":int(page_id), "page_title":pl_title})
writer(avro_file, schema, avro_content)