无法读回使用 python fastavro 包创建的微小 avro 文件

Can't read back tiny avro file created with python fastavro package

我有一个小文件要序列化 这是一个仅包含该行的文本文件:

INSERT INTO `pagelinks` VALUES (11442565,0,'Présent_de_narration',2600),(10265670,0,'Président',2600);

我用这个小程序来做,使用 fastavro 0.17.9 包(python 3.5.2 on Ubuntu 16.04 LTS):

import sys, re
from fastavro import writer

schema = {
    "namespace": "com.projet4.pagelinks",
    "type": "record",
    "name": "pagelink",
    "fields": [
        {"name": "page_id", "type": "int"},
        {"name": "page_title", "type": "string"}
    ]
}

insert_regex = re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')
row_regex = re.compile("""(.*),(.*),'(.*)',(.*)""")
for line in sys.stdin:
    avro_file = open("pagelinks.avro", 'wb')
    match = insert_regex.match(line.strip())
    if match is not None:
        data = match.groups(0)[0]
        rows = data[1:-1].split("),(")
        for row in rows:
            row_match = row_regex.match(row)
            if row_match is not None:
                # >>> row_match.groups()
                # (12,0,'Anti-statism',0)
                # # page_id, pl_namespace, pl_title, pl_from_namespace
                if row_match.groups()[1] == '0':
                      page_id, pl_title = row_match.groups()[0], row_match.groups()[2]
                      print(int(page_id), pl_title)
                      writer(avro_file, schema, [{"page_id":int(page_id), "page_title":pl_title}])

我用这个命令行启动程序:

cat pagelinks_nano.sql | ./parse_links_fastavro_test.py

好像成功了,avro 文件创建好了,然后我尝试读取它:

import fastavro
with open("pagelinks.avro", 'rb') as avro_file:
    reader = fastavro.reader(avro_file)
    print("Embedded Schema :\n\n",reader.schema,"\n\nLines :")
    for pagelink in reader:
        print(pagelink)

问题来了

文件已打开,架构出现,第一行也是 但是程序崩溃并显示消息:

Embedded Schema :

 {'name': 'pagelink', 'type': 'record', 'namespace': 'com.projet4.pagelinks', 'fields': [{'name': 'page_id', 'type': 'int'}, {'name': 'page_title', 'type': 'string'}]} 

Lines :
{'page_id': 11442565, 'page_title': 'Présent_de_narration'}
Traceback (most recent call last):
  File "./reading.py", line 5, in <module>
    for pagelink in reader:
  File "fastavro/_read.pyx", line 645, in _iter_avro
  File "fastavro/_read.pyx", line 548, in fastavro._read.skip_sync
ValueError: expected sync marker not found

是 fastavro 还是编码问题?

任何帮助将不胜感激:o

谢谢大家

浏览fastavro官方找到解决方案github

https://github.com/tebeka/fastavro/issues/12

"Currently fastavro supports only "one shot" 写道。但是记录可以是任何可迭代的,包括一个生成器,一个一个地创建记录。我会考虑追加。"

缓冲值并仅在循环结束时写入做到了

insert_regex = re.compile('''INSERT INTO `pagelinks` VALUES (.*)\;''')
row_regex = re.compile("""(.*),(.*),'(.*)',(.*)""")
avro_content = []
for line in sys.stdin:
    avro_file = open("pagelinks.avro", 'wb')
    match = insert_regex.match(line.strip())
    if match is not None:
        data = match.groups(0)[0]
        rows = data[1:-1].split("),(")
        for row in rows:
            row_match = row_regex.match(row)
            if row_match is not None:
                # >>> row_match.groups()
                # (12,0,'Anti-statism',0)
                # # page_id, pl_namespace, pl_title, pl_from_namespace
                if row_match.groups()[1] == '0':
                      page_id, pl_title = row_match.groups()[0], row_match.groups()[2]
                      avro_content.append({"page_id":int(page_id), "page_title":pl_title})
writer(avro_file, schema, avro_content)