将包含带有标记部分的 OrderedDict 的元组转换为 table,其中列以标记部分命名

Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts

标题更完整:将包含带标记部分的 OrderedDict 的元组转换为 table 列以标记部分命名(可变数量的标记部分和可变数量的标签出现)。

我对地址解析的了解比 python 多,这可能是问题的根源。如何做到这一点可能是显而易见的。 usaddress 库有意 return 以这种方式生成结果,这可能很有用。

我正在使用 usaddress,其中 "is a python library for parsing unstructured address strings into address components, using advanced NLP methods," 似乎工作得很好。这里是 the usaddress source and website.

所以我 运行 它在一个文件中,例如:

2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD 
4239 SW HWY 101, UNIT 19 
1315 NE HARBOR RIDGE 
4850 SE 51ST ST 
1501 SE EAST DEVILS LAKE RD 
1525 NE REGATTA WAY 
6458 NE MAST AVE 
4009 SW HWY 101 
814 SW 9TH ST 
1665 SALMON RIVER HWY 
3500 NE WEST DEVILS LAKE RD, UNIT 18 
1912 NE 56TH DR 
3334 NE SURF AVE 
2734 SW DUNE CT
2558 NE 33RD ST 
2600 NE 33RD ST 
5617 NW JETTY AVE 

我想将这些结果转换成更像 table 的东西(最终是 CSV 或数据库)。

我不确定return编辑了哪些数据类型。阅读文档,告诉我标记方法 return 是一个包含带标记部分的 OrderedDict 的元组。 parse 方法似乎 return 类型略有不同。 This question, helped me determine that it is a list and a tuple (apparently with tags). Searching for how to convert a python list with tagged parts to a table 未成功。

搜索如何转换包含 OrderedDict 的元组的结果并不多。 This is the closest that I found. I also found that pandas is good at various formatting tasks, although it was not clear to me how to apply pandas to this. Many of the closest question I've found like the opposite question or one with named tuples 得分很低。

我也尝试了一些探索性尝试,看看它是否会奏效(如下)。我能够看到一些访问数据的方法,并且从这个 Matrix Transpose question 使用 zip 更接近 table 因为数据和命名标签现在是分​​开的,虽然不统一。有没有办法将这些结果放在包含带标记部分的 OrderedDict 的标记列表或元组中到 table? returned 结果是否有相当直接的方法?

解析方法如下:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try the parse method
    parsed = usaddress.parse(line)
    ## See what the parse results look like
    zippy = [list(i) for i in zip(*parsed)]
    print(zippy)
    ## read the next line
    line = f.readline()

## close the file
f.close()

并生成结果(请注意,当标签有多个部分时,它会重复)。

[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]

标签方法如下:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try tag method
    tagged = usaddress.tag(line)
    ## See what the tag results look like
    items_ = list(tagged[0].items())
    zippy2 = [list(i) for i in zip(*items_)]
    print(zippy2)
    ## read the next line
    line = f.readline()

## close the file
f.close()

生成以下输出,可以更好地处理具有相同标签的多个部分的组合:

[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]

只需将 csv.DictWriter class 与您的标记方法一起使用:

from csv import DictWriter
import usaddress

tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
    # Note 2: You don't need to mess with readline() and while loops; 
    # just iterate over the file handle directly, it produces lines.
    for line in in_file:
        tagged = usaddress.tag(line)[0]
        tagged_lines.append(tagged)
        fields.update(tagged.keys()) # keep track of all field names we see

with open('address_sample.csv', 'w') as out_file:
    writer = DictWriter(out_file, fieldnames=fields)
    writer.writeheader()
    writer.writerows(tagged_lines)

请注意,这对于大文件来说效率很低,因为它会一次性将您输入的全部内容保存在内存中;唯一的原因是事先不知道字段名集(即 csv 列 headers)。

如果你知道完整的集合,你可以在一次流式传输中完成,在你阅读每一行时写下标记的输出。或者,您可以对文件进行一次遍历以生成 headers 的集合,然后第二次遍历以进行转换。