将 CHANGING OrderedDict 输出为 CSV

Question

我已经倾注了over this post，但答案似乎不符合我的需要。但是，我是 Python 的新手，所以这也可能是问题所在。

以下是 output.csv 中的几行：
个案当事人地址
25 THOMAS ST., PORTAGE, IN
67 CHESTNUT ST., 新泽西州米尔布鲁克
1 EMPIRE DR., AUSTIN, TX, 11225
华盛顿大街 111 号。 #404，瓦尔帕莱索，AK
89 E. JERICHO TPKE.，斯卡斯代尔，亚利桑那州

原始POST代码

import usaddress
import csv

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        print(data)

(OrderedDict([('AddressNumber', u'4167'), ('StreetNamePreType', u'Highway'), ('StreetName', u'319'), ('StreetNamePostDirectional', u'E'), ('PlaceName', u'Conway'), ('StateName', u'SC'), ('ZipCode', u'29526-5446')]), 'Street Address'

很像之前的post，我需要将解析后的数据输出成csv。据我所知，我需要执行以下步骤：

提供 header 作为列表以供参考。 (They're listed here in 'Details'.)
使用 Usaadress.tag()，将 source_csv 解析为 "data" 但保留其对应的 "keys."
将 key:data 映射到 header_reference
导出到具有一 header 行的 output_csv。

我正在使用 Python 模块 usaaddress 来解析大型 csv (200k+)。该模块使用 OrderedDict 输出解析后的数据。上述 post 仅在所有字段映射到所有记录的相同 header 时才有效。但是，usaddress 的众多好处之一是即使没有要解析的字段，它也会解析出数据。因此，例如，“123 Fake St, Maine, PA”完美映射到 address,city,state headers。但是“123 Jumping Block, Suite 600, Maine, PA”会将 "Suite 600" 放在 "city" 列中，因为它是基于位置进行静态匹配的。如果我自己解析后者，usaddress 会提供地址、占用标识符（例如 "suite #"）、城市、州 headers。

我使用解析器的在线解析器时提供了我需要的输出格式，但它一次只能容纳500行。

我的代码似乎在通过模块路由之前不知道每个数据点是什么； chicken-or-the-egg 情况。当每行可能具有不同的列子集时，如何将行写入 CSV 文件？

作为参考，当我尝试最接近的解决方案（由 isosceleswheel 提供）时得到的错误是 valueerror: I/O(...) 他们参考了 [=61= 的第 107 行和第 90 行] 库，两者都与字段名有关。

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)

with open('myoutputfile', 'w') as o:  # this will be the new file you write to
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        header = ','.join(data.keys()) + '\n'  # this will make a string of the header separated by comma with a newline at the end
        data_string = ','.join(data.values()) + '\n' # this will make a string of the values separated by comma with a newline at the end
        o.write(header + data_string)  # this will write the header and then the data on a new line with each field separated by commas

Answer 1

您想分别解析每个地址并存储在列表中。然后你可以使用 Pandas DataFrame 来对齐输出。像这样：

import pandas as pd

data = ['Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637',
        'State & Lake, Chicago']

tagged_addresses = [usaddress.parse(line) for line in data]

address_df = pd.DataFrame(tagged_addresses)

print(address_df)

  AddressNumber BuildingName IntersectionSeparator PlaceName SecondStreetName StateName StreetName StreetNamePostType StreetNamePreDirectional ZipCode
0          5757  Robie House                   NaN   Chicago              NaN        IL   Woodlawn             Avenue                    South   60637
1           NaN          NaN                     &   Chicago             Lake       NaN      State                NaN                      NaN     NaN

Answer 2

参见 this github issue 解决方案

因为我们知道 usaddress 中所有可能的标签，我们可以使用它们来定义输出中的字段。

我无法对答案发表评论 b/c 我没有足够的声誉，但我建议不要将 usaddress 解析方法用于此任务。 tag 方法将解析一个地址，然后在它们具有相同标签时连接连续的地址标记，如果存在具有相同标签的非连续标记，则会引发错误 - 最好在输出中捕获标记错误。

将 CHANGING OrderedDict 输出为 CSV

Output a CHANGING OrderedDict to CSV

python

csv

street-address