Unflatten JSON objects with indices/反序列化n-triples to hierarchical Excel
Unflatten JSON objects with indices / deserialize n-triples to hierarchical Excel
我已经从工具 Screaming Frog 解析了 JSON+LD(结构化)数据。此工具导出数据的格式不可用,因为 parent/child 关系 (cross-reference) 不在 Excel 中的一行。编辑:这种序列化格式称为 n-triples。下面是一个带有索引关系 colour-coded 的示例输出(很抱歉还不允许 post 图片):
Subject Predicate Object
subject27 schema.org/aggregateRating subject28
subject27 schema.org/offers subject29
subject27 schema.org/operatingSystem ANDROID
subject27 type schema.org/SoftwareApplication
subject28 schema.org/ratingCount 15559
subject28 schema.org/ratingValue 3.597853422
subject28 type schema.org/AggregateRating
subject29 schema.org/price 0
subject29 type schema.org/Offer
下面是所需的最终输出示例,其中所有嵌套级别都在其自己的列中。每个嵌套级别(最多 4 层)应映射到其自己的列中,重复 parent 路径信息。
Predicate L1 Object L1 Predicate L2 Object L2
type schema.org/SoftwareApplication
schema.org/operatingSystem ANDROID
schema.org/aggregateRating subject28 schema.org/ratingCount 15559
schema.org/aggregateRating subject28 schema.org/ratingValue 3.597853422
schema.org/aggregateRating subject28 type schema.org/AggregateRating
schema.org/offers subject29 schema.org/price 0
schema.org/offers subject29 type schema.org/Offer
我一直在寻找现有的非扁平化解决方案,但这些要么使用存储在单个列中的路径信息(每个 "lowest level value" 都有自己的 "row"),要么不重建原始数据基于指数。
我希望通过结合使用 for 循环和 SQL JOINS 来完成此操作,但我觉得必须有更优雅的解决方案。这可以在 Python、PHP、JS 或 SQL 或组合中,甚至可以将每个 "subject" 添加到 MongoDB 文档中,然后应用合并操作这个?
编辑:更新标题以优化本文的 SEO。我正在使用的这个 RDF 和 JSON+LD 数据的序列化格式称为 N-triples。在这里阅读更多:https://medium.com/wallscope/understanding-linked-data-formats-rdf-xml-vs-turtle-vs-n-triples-eb931dbe9827
这可能是各种丑陋的,而且在很多方面肯定不是 Pythonic 的,但它可以在您的示例数据上完成工作:
import re
def group_items(items, prop):
group = {}
for item in items:
key = item[prop]
if key not in group:
group[key] = []
group[key].append(item)
return group
with open('input.txt', encoding='utf8') as f:
# analyze column widths on the example of the header row
# this allows for flexible column withds in the input data
header_row = next(f)
columns = re.findall('\S+\s*', header_row.rstrip('\n'))
i = 0
cols = []
headers = []
for c in columns:
headers.append( c.strip() )
cols.append( [i, i + len(c)] )
i += len(c)
cols[-1][1] = 100000 # generous data length for last column
# extract one item per line, using those column widths
items = []
for line in f:
item = {}
for c, col in enumerate(cols):
item[headers[c]] = line[col[0]:col[1]].strip()
items.append(item)
# group items to figure out which ones are at the root
items_by_subject = group_items(items, 'Subject')
items_by_object = group_items(items, 'Object')
# root keys are those that are not anyone else's subject
root_keys = set(items_by_subject.keys()) - set(items_by_object.keys())
root_items = [items_by_subject[k] for k in root_keys]
# recursive function to walk the tree and determine the leafs
leafs = []
def unflatten(items, parent=None, level=1):
for item in items:
item['Parent'] = parent
item['Level'] = level
key = item['Object']
if key in items_by_subject:
unflatten(items_by_subject[key], item, level+1)
else:
leafs.append(item)
# ...which needs to be called for each group of root items
for group in root_items:
unflatten(group)
# this is not limited to 4 levels
max_level = max(item['Level'] for item in leafs)
# recursive function to fill in parent data
def fill_data(item, output={}):
parent = item['Parent']
if parent is not None:
fill_data(parent, output)
output['Predicate L%s' % item['Level']] = item['Predicate']
output['Object L%s' % item['Level']] = item['Object']
# ...which needs to be called once per leaf
result = []
for leaf in reversed(leafs):
output = {}
for l in range(1, max_level + 1):
output['Predicate L%s' % l] = None
output['Object L%s' % l] = None
fill_data(leaf, output)
result.append(output)
# output result
for item in result:
print(item)
鉴于您的示例输入为 input.txt
,输出如下:
{'Predicate L1': 'type', 'Object L1': 'schema.org/SoftwareApplication', 'Predicate L2': None, 'Object L2': None}
{'Predicate L1': 'schema.org/operatingSystem', 'Object L1': 'ANDROID', 'Predicate L2': None, 'Object L2': None}
{'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'type', 'Object L2': 'schema.org/Offer'}
{'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'schema.org/price', 'Object L2': '0'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'type', 'Object L2': 'schema.org/AggregateRating'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingValue', 'Object L2': '3.597853422'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingCount', 'Object L2': '15559'}
我将把它放入某种文件中作为练习。
我已经从工具 Screaming Frog 解析了 JSON+LD(结构化)数据。此工具导出数据的格式不可用,因为 parent/child 关系 (cross-reference) 不在 Excel 中的一行。编辑:这种序列化格式称为 n-triples。下面是一个带有索引关系 colour-coded 的示例输出(很抱歉还不允许 post 图片):
Subject Predicate Object
subject27 schema.org/aggregateRating subject28
subject27 schema.org/offers subject29
subject27 schema.org/operatingSystem ANDROID
subject27 type schema.org/SoftwareApplication
subject28 schema.org/ratingCount 15559
subject28 schema.org/ratingValue 3.597853422
subject28 type schema.org/AggregateRating
subject29 schema.org/price 0
subject29 type schema.org/Offer
下面是所需的最终输出示例,其中所有嵌套级别都在其自己的列中。每个嵌套级别(最多 4 层)应映射到其自己的列中,重复 parent 路径信息。
Predicate L1 Object L1 Predicate L2 Object L2
type schema.org/SoftwareApplication
schema.org/operatingSystem ANDROID
schema.org/aggregateRating subject28 schema.org/ratingCount 15559
schema.org/aggregateRating subject28 schema.org/ratingValue 3.597853422
schema.org/aggregateRating subject28 type schema.org/AggregateRating
schema.org/offers subject29 schema.org/price 0
schema.org/offers subject29 type schema.org/Offer
我一直在寻找现有的非扁平化解决方案,但这些要么使用存储在单个列中的路径信息(每个 "lowest level value" 都有自己的 "row"),要么不重建原始数据基于指数。
我希望通过结合使用 for 循环和 SQL JOINS 来完成此操作,但我觉得必须有更优雅的解决方案。这可以在 Python、PHP、JS 或 SQL 或组合中,甚至可以将每个 "subject" 添加到 MongoDB 文档中,然后应用合并操作这个?
编辑:更新标题以优化本文的 SEO。我正在使用的这个 RDF 和 JSON+LD 数据的序列化格式称为 N-triples。在这里阅读更多:https://medium.com/wallscope/understanding-linked-data-formats-rdf-xml-vs-turtle-vs-n-triples-eb931dbe9827
这可能是各种丑陋的,而且在很多方面肯定不是 Pythonic 的,但它可以在您的示例数据上完成工作:
import re
def group_items(items, prop):
group = {}
for item in items:
key = item[prop]
if key not in group:
group[key] = []
group[key].append(item)
return group
with open('input.txt', encoding='utf8') as f:
# analyze column widths on the example of the header row
# this allows for flexible column withds in the input data
header_row = next(f)
columns = re.findall('\S+\s*', header_row.rstrip('\n'))
i = 0
cols = []
headers = []
for c in columns:
headers.append( c.strip() )
cols.append( [i, i + len(c)] )
i += len(c)
cols[-1][1] = 100000 # generous data length for last column
# extract one item per line, using those column widths
items = []
for line in f:
item = {}
for c, col in enumerate(cols):
item[headers[c]] = line[col[0]:col[1]].strip()
items.append(item)
# group items to figure out which ones are at the root
items_by_subject = group_items(items, 'Subject')
items_by_object = group_items(items, 'Object')
# root keys are those that are not anyone else's subject
root_keys = set(items_by_subject.keys()) - set(items_by_object.keys())
root_items = [items_by_subject[k] for k in root_keys]
# recursive function to walk the tree and determine the leafs
leafs = []
def unflatten(items, parent=None, level=1):
for item in items:
item['Parent'] = parent
item['Level'] = level
key = item['Object']
if key in items_by_subject:
unflatten(items_by_subject[key], item, level+1)
else:
leafs.append(item)
# ...which needs to be called for each group of root items
for group in root_items:
unflatten(group)
# this is not limited to 4 levels
max_level = max(item['Level'] for item in leafs)
# recursive function to fill in parent data
def fill_data(item, output={}):
parent = item['Parent']
if parent is not None:
fill_data(parent, output)
output['Predicate L%s' % item['Level']] = item['Predicate']
output['Object L%s' % item['Level']] = item['Object']
# ...which needs to be called once per leaf
result = []
for leaf in reversed(leafs):
output = {}
for l in range(1, max_level + 1):
output['Predicate L%s' % l] = None
output['Object L%s' % l] = None
fill_data(leaf, output)
result.append(output)
# output result
for item in result:
print(item)
鉴于您的示例输入为 input.txt
,输出如下:
{'Predicate L1': 'type', 'Object L1': 'schema.org/SoftwareApplication', 'Predicate L2': None, 'Object L2': None}
{'Predicate L1': 'schema.org/operatingSystem', 'Object L1': 'ANDROID', 'Predicate L2': None, 'Object L2': None}
{'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'type', 'Object L2': 'schema.org/Offer'}
{'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'schema.org/price', 'Object L2': '0'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'type', 'Object L2': 'schema.org/AggregateRating'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingValue', 'Object L2': '3.597853422'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingCount', 'Object L2': '15559'}
我将把它放入某种文件中作为练习。