将包含带有标记部分的 OrderedDict 的元组转换为 table,其中列以标记部分命名
Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts
标题更完整:将包含带标记部分的 OrderedDict 的元组转换为 table 列以标记部分命名(可变数量的标记部分和可变数量的标签出现)。
我对地址解析的了解比 python 多,这可能是问题的根源。如何做到这一点可能是显而易见的。 usaddress
库有意 return 以这种方式生成结果,这可能很有用。
我正在使用 usaddress
,其中 "is a python library for parsing unstructured address strings into address components, using advanced NLP methods," 似乎工作得很好。这里是 the usaddress
source and website.
所以我 运行 它在一个文件中,例如:
2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD
4239 SW HWY 101, UNIT 19
1315 NE HARBOR RIDGE
4850 SE 51ST ST
1501 SE EAST DEVILS LAKE RD
1525 NE REGATTA WAY
6458 NE MAST AVE
4009 SW HWY 101
814 SW 9TH ST
1665 SALMON RIVER HWY
3500 NE WEST DEVILS LAKE RD, UNIT 18
1912 NE 56TH DR
3334 NE SURF AVE
2734 SW DUNE CT
2558 NE 33RD ST
2600 NE 33RD ST
5617 NW JETTY AVE
我想将这些结果转换成更像 table 的东西(最终是 CSV 或数据库)。
我不确定return编辑了哪些数据类型。阅读文档,告诉我标记方法 return 是一个包含带标记部分的 OrderedDict 的元组。 parse 方法似乎 return 类型略有不同。 This question, helped me determine that it is a list and a tuple (apparently with tags). Searching for how to convert a python list with tagged parts to a table 未成功。
搜索如何转换包含 OrderedDict 的元组的结果并不多。 This is the closest that I found. I also found that pandas is good at various formatting tasks, although it was not clear to me how to apply pandas to this. Many of the closest question I've found like the opposite question or one with named tuples 得分很低。
我也尝试了一些探索性尝试,看看它是否会奏效(如下)。我能够看到一些访问数据的方法,并且从这个 Matrix Transpose question 使用 zip 更接近 table 因为数据和命名标签现在是分开的,虽然不统一。有没有办法将这些结果放在包含带标记部分的 OrderedDict 的标记列表或元组中到 table? returned 结果是否有相当直接的方法?
解析方法如下:
## Get a library
import usaddress
## Open the file with read only permmission
f = open('address_sample.txt')
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try the parse method
parsed = usaddress.parse(line)
## See what the parse results look like
zippy = [list(i) for i in zip(*parsed)]
print(zippy)
## read the next line
line = f.readline()
## close the file
f.close()
并生成结果(请注意,当标签有多个部分时,它会重复)。
[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
标签方法如下:
## Get a library
import usaddress
## Open the file with read only permmission
f = open('address_sample.txt')
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try tag method
tagged = usaddress.tag(line)
## See what the tag results look like
items_ = list(tagged[0].items())
zippy2 = [list(i) for i in zip(*items_)]
print(zippy2)
## read the next line
line = f.readline()
## close the file
f.close()
生成以下输出,可以更好地处理具有相同标签的多个部分的组合:
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]
只需将 csv.DictWriter
class 与您的标记方法一起使用:
from csv import DictWriter
import usaddress
tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
# Note 2: You don't need to mess with readline() and while loops;
# just iterate over the file handle directly, it produces lines.
for line in in_file:
tagged = usaddress.tag(line)[0]
tagged_lines.append(tagged)
fields.update(tagged.keys()) # keep track of all field names we see
with open('address_sample.csv', 'w') as out_file:
writer = DictWriter(out_file, fieldnames=fields)
writer.writeheader()
writer.writerows(tagged_lines)
请注意,这对于大文件来说效率很低,因为它会一次性将您输入的全部内容保存在内存中;唯一的原因是事先不知道字段名集(即 csv 列 headers)。
如果你知道完整的集合,你可以在一次流式传输中完成,在你阅读每一行时写下标记的输出。或者,您可以对文件进行一次遍历以生成 headers 的集合,然后第二次遍历以进行转换。
标题更完整:将包含带标记部分的 OrderedDict 的元组转换为 table 列以标记部分命名(可变数量的标记部分和可变数量的标签出现)。
我对地址解析的了解比 python 多,这可能是问题的根源。如何做到这一点可能是显而易见的。 usaddress
库有意 return 以这种方式生成结果,这可能很有用。
我正在使用 usaddress
,其中 "is a python library for parsing unstructured address strings into address components, using advanced NLP methods," 似乎工作得很好。这里是 the usaddress
source and website.
所以我 运行 它在一个文件中,例如:
2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD
4239 SW HWY 101, UNIT 19
1315 NE HARBOR RIDGE
4850 SE 51ST ST
1501 SE EAST DEVILS LAKE RD
1525 NE REGATTA WAY
6458 NE MAST AVE
4009 SW HWY 101
814 SW 9TH ST
1665 SALMON RIVER HWY
3500 NE WEST DEVILS LAKE RD, UNIT 18
1912 NE 56TH DR
3334 NE SURF AVE
2734 SW DUNE CT
2558 NE 33RD ST
2600 NE 33RD ST
5617 NW JETTY AVE
我想将这些结果转换成更像 table 的东西(最终是 CSV 或数据库)。
我不确定return编辑了哪些数据类型。阅读文档,告诉我标记方法 return 是一个包含带标记部分的 OrderedDict 的元组。 parse 方法似乎 return 类型略有不同。 This question, helped me determine that it is a list and a tuple (apparently with tags). Searching for how to convert a python list with tagged parts to a table 未成功。
搜索如何转换包含 OrderedDict 的元组的结果并不多。 This is the closest that I found. I also found that pandas is good at various formatting tasks, although it was not clear to me how to apply pandas to this. Many of the closest question I've found like the opposite question or one with named tuples 得分很低。
我也尝试了一些探索性尝试,看看它是否会奏效(如下)。我能够看到一些访问数据的方法,并且从这个 Matrix Transpose question 使用 zip 更接近 table 因为数据和命名标签现在是分开的,虽然不统一。有没有办法将这些结果放在包含带标记部分的 OrderedDict 的标记列表或元组中到 table? returned 结果是否有相当直接的方法?
解析方法如下:
## Get a library
import usaddress
## Open the file with read only permmission
f = open('address_sample.txt')
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try the parse method
parsed = usaddress.parse(line)
## See what the parse results look like
zippy = [list(i) for i in zip(*parsed)]
print(zippy)
## read the next line
line = f.readline()
## close the file
f.close()
并生成结果(请注意,当标签有多个部分时,它会重复)。
[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
标签方法如下:
## Get a library
import usaddress
## Open the file with read only permmission
f = open('address_sample.txt')
## Read the first line
line = f.readline()
## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
## Try tag method
tagged = usaddress.tag(line)
## See what the tag results look like
items_ = list(tagged[0].items())
zippy2 = [list(i) for i in zip(*items_)]
print(zippy2)
## read the next line
line = f.readline()
## close the file
f.close()
生成以下输出,可以更好地处理具有相同标签的多个部分的组合:
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]
只需将 csv.DictWriter
class 与您的标记方法一起使用:
from csv import DictWriter
import usaddress
tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
# Note 2: You don't need to mess with readline() and while loops;
# just iterate over the file handle directly, it produces lines.
for line in in_file:
tagged = usaddress.tag(line)[0]
tagged_lines.append(tagged)
fields.update(tagged.keys()) # keep track of all field names we see
with open('address_sample.csv', 'w') as out_file:
writer = DictWriter(out_file, fieldnames=fields)
writer.writeheader()
writer.writerows(tagged_lines)
请注意,这对于大文件来说效率很低,因为它会一次性将您输入的全部内容保存在内存中;唯一的原因是事先不知道字段名集(即 csv 列 headers)。
如果你知道完整的集合,你可以在一次流式传输中完成,在你阅读每一行时写下标记的输出。或者,您可以对文件进行一次遍历以生成 headers 的集合,然后第二次遍历以进行转换。